Monday, May 19 2008

Importing furigana into Word

Aozora Bunko is, more or less, the Japanese version of Project Gutenberg. As I’ve mentioned before, they have a simple markup convention to handle phonetic guides and textual notes. The notes can get a bit complicated, referring to obsolete kanji and special formatting, but the phonetic part is simple to parse.

I can easily convert it to my pop-up furigana for online use (which I think is more useful than the real thing at screen resolution), but for my reading class, it would be nice to make real furigana to print out. A while back I started tinkering with using Word’s RTF import for this, but gave up because it was a pain in the ass. Among other problems, the RTF parser is very fragile, and syntax errors can send it off into oblivion.

Tonight, while I was working on something else, I remembered that Word has an allegedly reasonable HTML parser, and IE was the first browser to support the HTML tags for furigana. So I stripped the RTF code out of my script, generated simple HTML, and sent it to Word. Success! Also a spinning beach-ball for a really long time, but only for the first document; once Word loaded whatever cruft it needed, that session would convert subsequent HTML documents quickly. It even obeys simple CSS, so I could set the main font size and line spacing, as well as the furigana size.

Two short Perl scripts: shiftjis2utf8 and aozora-ruby.

[Note that Aozora Bunko actually supplies XHTML versions of their texts with properly-tagged furigana, but they also do some other things to the text that I don’t want to try to import into Word, like replacing references to obsolete kanji with PNG files.]