Every time I include some Japanese text in a blog entry, I’m torn between adding furigana and, well, not. It’s extremely useful for people who don’t read kanji well, but it’s tedious to do by hand in HTML. At the same time, I find myself wishing that my Rosetta Stone courseware included furigana, so that I could hover the mouse over a word and see the pronunciation instead of switching from kanji to kana mode and back. I’d also like to see their example phrases in a better font, at higher resolution.

80 lines of Perl later:



[update: I tested this under IE on my Windows machine at work, and it correctly displayed the pop-up furigana, but ignored the CSS that highlighted the word it applied to; apparently my machine has extra magic installed, because the pop-up doesn’t use a Unicode font for some people. Sigh. Found! fixing tooltips in IE (about halfway down the page)]

The script takes advantage of Perl’s Encode module to natively manipulate Unicode strings, parsing a simple data format. Each phrase consists of one or more lines, like this:



 に   いる

  / は   います。

[update: to no great surprise, Internet Explorer makes a hash out of this, replacing full-width space characters with “something else”. Try it in a modern browser, and not only will the columns line up, but the CSS span.kana:hover tag will work, too]

Entries are delimited by blank lines. If there’s just one line, it’s left alone. If there are two lines, the first contains a single word, and the second contains the furigana for it. If there are more than two lines, the first contains a phrase, the second repeats the phrase but replaces the kanji words with whitespace, and each remaining line contains the pronunciation of one such word. If two words needing furigana are right next to each other, the last character of each one is marked with a “/”.

This format is fairly easy to write in, and the Perl script to parse it is pretty trivial.

[sigh; Safari is getting too smart for its own good. If you click on the link to the script, it will ignore the Content-Type: header returned by Apache and search inside it, finding the embedded html lines that are inside of a print statement. Update: they must have learned it from IE…]