Importing furigana into Word

Aozora Bunko is, more or less, the Japanese version of Project Gutenberg. As I’ve mentioned before, they have a simple markup convention to handle phonetic guides and textual notes. The notes can get a bit complicated, referring to obsolete kanji and special formatting, but the phonetic part is simple to parse.

I can easily convert it to my pop-up furigana for online use (which I think is more useful than the real thing at screen resolution), but for my reading class, it would be nice to make real furigana to print out. A while back I started tinkering with using Word’s RTF import for this, but gave up because it was a pain in the ass. Among other problems, the RTF parser is very fragile, and syntax errors can send it off into oblivion.

Tonight, while I was working on something else, I remembered that Word has an allegedly reasonable HTML parser, and IE was the first browser to support the HTML tags for furigana. So I stripped the RTF code out of my script, generated simple HTML, and sent it to Word. Success! Also a spinning beach-ball for a really long time, but only for the first document; once Word loaded whatever cruft it needed, that session would convert subsequent HTML documents quickly. It even obeys simple CSS, so I could set the main font size and line spacing, as well as the furigana size.

Two short Perl scripts: shiftjis2utf8 and aozora-ruby.

[Note that Aozora Bunko actually supplies XHTML versions of their texts with properly-tagged furigana, but they also do some other things to the text that I don’t want to try to import into Word, like replacing references to obsolete kanji with PNG files.]

Sony Reader 505

So I bought the second-generation Sony Reader. Thinner, faster, crisper screen, cleaned-up UI, USB2 mass storage for easy import, and some other improvements over the previous one. It still has serious limitations, and in a year or two it will be outclassed at half the price, but I actually have a real use for a book-sized e-ink reader right now: I’m finally going to Japan, and we’ll be playing tourist.

My plan is to dump any and all interesting information onto the Reader, and not have to keep track of travel books, maps, etc. It has native support for TXT, PDF, PNG, and JPG, and there are free tools for converting and resizing many other formats.

Letter and A4-sized PDFs are generally hard to read, but I have lots of experience creating my own custom-sized stuff with PDF::API2::Lite, so that’s no trouble at all. The PDF viewer has no zoom, but the picture viewer does, so I’ll be dusting off my GhostScript-based pdf2png script for maps and other one-page documents that need to be zoomed.

I’ll write up a detailed review soon, but so far there’s only one real annoyance: very limited kanji support. None at all in the book menus, which didn’t surprise me, and silent failure in the PDF viewer, which did. Basically, any embedded font in a PDF file is limited to 512 characters; if it has more, characters in that font simply won’t appear in the document at all.

The English Wikipedia and similar sites tend to work fine, because a single document will only have a few words in Japanese. That’s fine for the trip, but now that I’ve got the thing, I want to put some reference material on it. I have a script that pulls data from EDICT and KANJIDIC and generates a PDF kanji dictionary with useful vocabulary, but I can’t use it on the Reader.

…unless I embed multiple copies of the same font, and keep track of how many characters I’ve used from each one. This turns out to be trivial with PDF::API2::Lite, but it does significantly increase the size of the output file, and I can’t clean it up in Acrobat Distiller, because that application correctly collapses the duplicates down to one embedded font.

I haven’t checked to see if the native Librie format handles font-embedding properly. I’ll have to install the free Windows software at some point and give it a try.

[Update: I couldn’t persuade Distiller to leave the multiple copies of the font alone, because OpenType CID fonts apparently embed a unique ID in several places. FontForge was perfectly happy to convert it to a non-CID TrueType font, and then I only had to rename one string to create distinct fonts for embedding. My test PDF works fine on the Reader now.]


Restoring Chizumatic‘s sidebar to its rightful place was a task worth pursuing, but since the Minx templates generate tag soup, standard validation tools produced too many errors to help much (W3C’s produced ~700 errors, compared to this page’s 16, 14 of which are parser errors in Amazon search URLs).

So I tried a different approach:

while (<STDIN>) {
    next unless /<\/?div/i;
    if (@x = /<div/gi) {
        print "[$l] $. ",@x+0," $_\n";
        $l += @x;
    if (@x = /<\/div/gi) {
        print "[$l] $. ",@x+0," $_\n";
        $l -= @x;
print "[$l]\n";

Skimming through the output, I saw that the inline comments started at level 6, until I reached comment 8 in the “Shingu 20” entry, which started at level 7. Sure enough, what should have been a (pardon my french) <tt></div></p></div></tt> in the previous comment was just a <tt></p></div></tt>.

[Update: fixing one bad Amazon URL removed 14 of the 16 validation errors on this page, and correcting a Movable Type auto-formatting error got rid of the other two. See, validation is easy! :-)]

JMdict + XML::Twig + DBD::SQLite = ?

In theory, I’m still working at Digeo. In practice, not so much. As we wind our way closer to the layoff date, I have less and less actual work to do, and more and more “anticipating the future failures of our replacements”. On the bright side, I’ve had a lot of time to study Japanese and prepare for Level 3 of the JLPT, which is next weekend.

I’m easily sidetracked, though, and the latest side project is importing the freely-distributed JMdict/EDICT and KANJIDIC dictionaries into a database and wrapping it with Perl, so that I can more easily incorporate them into my PDF-generating scripts.

Unfortunately, all of the tree-based XML parsing libraries for Perl create massive memory structures (I killed the script after it got past a gig), and the stream-based ones don’t make a lot of sense to me. XML::Twig‘s documentation is oriented toward transforming XML rather than importing it, but it’s capable of doing the right thing without writing ridiculously unPerly code:

my $twig = new XML::Twig(
        twig_handlers => { entry => &parse_entry });
sub parse_entry {
        my $ref = $_[1]->simplify;
        print "Entry ID=",$ref->{ent_seq},"\n";

SQLite was the obvious choice for a back-end database. It’s fast, free, stable, serverless, and Apple’s supporting it as part of Core Data.

Putting it all together meant finally learning how to read XML DTDs, getting a crash course in SQL database design to lay out the tables in a sensible way, and writing useful and efficient queries. I’m still working on that last part, but I’ve gotten far enough that my lookup tool has basic functionality: given a complete or partial search term in kanji, kana, or English, it returns the key parts of the matching entries. Getting all of the available data assembled requires both joins and multiple queries, which is tedious to sort out.

I started with JMdict, which is a lot bigger and more complicated, so importing KANJIDIC2 is going to be easy when I get around to it. That won’t be this week, though, because the JLPT comes but once a year, and finals week comes right after.

[side note: turning off auto-commit and manually committing after every 500th entry cut the import time by 2/3]

Spam-guarding email addresses

I’ve been playing with jQuery recently. The major project I’m just about ready to roll out is a significant improvement to my pop-up furigana-izer. If native tooltips actually worked reliably in browsers, it would be fine, but they don’t, so I spent a day sorting out all of the issues, and while I was at it, I also added optional kana-to-romaji conversion.

I’ll probably roll that out this weekend, updating a bunch of my old Japanese entries to use it, but while I was finishing it up I had another idea: effective spam-protection for email addresses on my comment pages. The idea is simple. Replace all links that look like this: email with something like this: email, and use jQuery to reassemble the real address from a hash table when the page is loaded, inserting it into the HREF attribute.

A full working example looks like this:



I want a better text editor. What I really, really want, I think, is Gnu-Emacs circa 1990, with Unicode support and a fairly basic Cocoa UI. What I’ve got now is the heavily-crufted modern Gnu-Emacs supplied with Mac OS X, running in, and when I need to type kanji into a plain-text file.

So I’ve been trying out TextWrangler recently, whose virtues include being free and supporting a reasonable subset of Emacs key-bindings. Unfortunately, the default configuration is J-hostile, and a number of settings can’t be changed for the current document, only for future opens, and its many configuration options are “less than logically sorted”.

What don’t I like?

First, the “Documents Drawer” is a really stupid idea, and turning it off involves several checkboxes in different places. What’s it like? Tabbed browsing with invisible tabs; it’s possible to have half a dozen documents open in the same window, with no visual indication that closing that window will close them all, and the default “close” command does in fact close the window rather than a single document within it.

Next, I find the concept of a text editor that needs a “show invisibles” option nearly as repulsive as a “show invisibles” option that doesn’t actually show all of the invisible characters. Specifically, if you select the default Unicode encoding, a BOM character is silently inserted at the beginning of your file. “Show invisibles” won’t tell you; I had to use /usr/bin/od to figure out why my furiganizer was suddenly off by one character.

Configuring it to use the same flavor of Unicode as TextEdit and other standard Mac apps is easy once you find it in the preferences, but fixing damaged text files is a bit more work. TextWrangler won’t show you this invisible BOM character, and /usr/bin/file doesn’t differentiate between Unicode flavors. I’m glad I caught it early, before I had dozens of allegedly-text files with embedded 文字化け. The fix is to do a “save as…”, click the Options button in the dialog box, and select the correct encoding.

Basically, over the course of several days, I discovered that a substantial percentage of the default configuration settings either violated the principle of least surprise or just annoyed the living fuck out of me. I think I’ve got it into a “mostly harmless” state now, but the price was my goodwill; where I used to be lukewarm about the possibility of buying their higher-end editor, BBEdit, now I’m quite cool: what other unpleasant surprises have they got up their sleeves?

By contrast, I’m quite fond of their newest product, Yojimbo, a mostly-free-form information-hoarding utility. It was well worth the price, even with its current quirks and limitations.

Speaking of quirks, my TextWrangler explorations yielded a fun one. One of its many features, shared with BBEdit, is a flexible syntax-coloring scheme for programming languages. Many languages are supported by external modules, but Perl is built in, and their support for it is quite mature.

Unfortunately for anyone writing an external parser, Perl’s syntax evolved over time, and was subjected to some peculiar influences. I admit to doing my part in this, as one of the first people to realize that the arguments to the grep() function were passed by reference, and that this was really cool and deserved to be blessed. I think I was also the first to try modifying $a and $b in a sort function, which was stupid, but made sense at the time. By far the worst, however, from the point of view of clarity, was Perl poetry. All those pesky quotes around string literals were distracting, you see, so they were made optional.

This is still the case, and while religious use of use strict; will protect you from most of them, there are places where unquoted string literals are completely unambiguous, and darn convenient as well. Specifically, when an unquoted string literal appears in list context followed by the syntactic sugar “=>” [ex: (foo => “bar”)], and when it appears in scalar context surrounded by braces [ex: $x{foo}].

TextWrangler and BBEdit are blissfully unaware of these “bareword” string literals, and make no attempt to syntax-color them. I think that’s a reasonable behavior, whether deliberate or accidental, but it has one unpleasant side-effect: interpreting barewords as operators.

Here’s the stripped-down example I sent them, hand-colored to match TextWrangler’s incorrect parsing:


use strict;

my %foo;
$foo{a} = 1;
$foo{x} = 0;

my %bar = (y=>1,z=>1,x=>1);

$foo{y} = f1() + f2() + f3();

sub f1 {return 0}
sub f2 {return 1}

sub f3 {return 2}

PDF::API2,, kanji fonts, and me

I’d love to know why this PDF file displays its text correctly in Acrobat Reader, but not in (compare to this one, which does). Admittedly, the application generating it is including the entire font, not just the subset containing the characters used (which is why it’s so bloody huge), but it’s a perfectly reasonable thing to do in PDF. A bit rude to the bandwidth-impaired, perhaps, but nothing more.

While I’m on the subject of flaws in, let me point out two more. One that first shipped with Tiger is the insistence on displaying and printing Aqua data-entry fields in PDF files containing Acrobat forms, even when no data has been entered. Compare and contrast with Acrobat, which only displays the field boundaries while that field has focus. Result? Any page element that overlaps a data-entry field is obscured, making it impossible to view or print the blank form. How bad could it be? This bad (I’ll have to make a screenshot for the users…).

The other problem is something I didn’t know about until yesterday (warning: long digression ahead). I’ve known for some time that only certain kanji fonts will appear in when I generate PDFs with PDF::API2 (specifically, Kozuka Mincho Pro and Ricoh Seikaisho), but for a while I was willing to work with that limitation. Yesterday, however, I broke down and bought a copy of the OpenType version of DynaFont’s Kyokasho, specifically to use it in my kanji writing practice. As I sort-of expected, it didn’t work.

[Why buy this font, which wasn’t cheap? Mincho is a Chinese style used in books, magazines, etc; it doesn’t show strokes the way you’d write them by hand. Kaisho is a woodblock style that shows strokes clearly, but they’re not always the same strokes. Kyoukasho is the official style used to teach kanji writing in primary-school textbooks in Japan. (I’d link to the nice page at sci.lang.japan FAQ that shows all of them at once, but it’s not there any more, and several of the new pages are just editing stubs; I’ll have to make a sample later)]

Anyway, what I discovered was that if you open the un-Preview-able PDF in the full version of Adobe Acrobat, save it as PostScript, and then let convert it back to PDF, not only does it work (see?), the file size has gone from 4.2 megabytes to 25 kilobytes. And it only takes a few seconds to perform this pair of conversions.

Wouldn’t it be great to automate this task using something like AppleScript? Yes, it would. Unfortunately, is not scriptable. Thanks, guys. Fortunately, Acrobat Distiller is scriptable and just as fast.

On the subject of “why I’m doing this in the first place,” I’ve decided that the only useful order to learn new kanji in is the order they’re used in the textbooks I’m stuck with for the next four quarters. The authors don’t seem to have any sensible reasons for the order they’ve chosen, but they average 37 new kanji per lesson, so at least they’re keeping track. Since no one else uses the same order, and the textbooks provide no support for actually learning kanji, I have to roll my own.

There are three Perl scripts involved, which I’ll clean up and post eventually: the first reads a bunch of vocabulary lists and figures out which kanji are new to each lesson, sorted by stroke count and dictionary order; the second prints out the practice PDF files; the third is for vocabulary flashcards, which I posted a while back. I’ve already gone through the first two lessons with the Kaisho font, but I’m switching to the Kyoukasho now that I’ve got it working.

Putting it all together, my study sessions look like this. For each new kanji, look it up in The Kanji Learner’s Dictionary to get the stroke order, readings, and meaning; trace the Kyoukasho sample several times while mumbling the readings; write it out 15 more times on standard grid paper; write out all the readings on the same grid paper, with on-yomi in katakana and kun-yomi in hiragana, so that I practice both. When I finish all the kanji in a lesson, I write out all of the vocabulary words as well as the lesson’s sample conversation. Lather, rinse, repeat.

My minimum goal is to catch up on everything we used in the previous two quarters (~300 kanji), and then keep up with each lesson as I go through them in class. My stretch goal is to get through all of the kanji in the textbooks by the end of the summer (~1000), giving me an irregular but reasonably large working set, and probably the clearest handwriting I’ve ever had. :-)

"What do you do with a B6 notebook?"

(note: for some reason, my brain keeps trying to replace the last two words in the subject with “drunken sailor”; can’t imagine why)

Kyokuto makes some very nice notebooks. Sturdy covers in leather or plastic, convenient size, and nicely formatted refill pages. I found them at MaiDo Stationery, but Kinokuniya carries some of them as well. I like the B6 size best for portability; B5 is more of an office/classroom size, and A5 just seems to be both too big and too small. B6 is also the size that Kodansha publishes all their Japanese reference books in, including my kanji dictionary, which is a nice bonus.

[This is, by the way, the Japanese B6 size rather than the rarely-used ISO B-series. When Japan adopted the ISO paper standard, the B-series looked just a wee bit too small, so they redefined it to have 50% larger area than the corresponding A-series size. Wikipedia has the gory details.]

I really like the layout of Kyokuto’s refill paper. So much so, in fact, that I used PDF::API2::Lite to clone it. See? The script is a little rough at the moment, mostly because it also does 5mm grid paper, 20x20 tategaki report paper, and B8/3 flashcards, and I’m currently adding kanji practice grids with the characters printed in gray in my Kyoukasho-tai font. I’ll post it later after it’s cleaned up.

Why, yes, I was stuck in the office today watching a server upgrade run. However did you guess?

On a related note, am I the only person in the world who thinks that it’s silly to spend $25+ on one of those gaudy throwaway “journals” that are pretty much the only thing you can find in book and stationery stores these days? Leather/wood/fancy cover, magnet/strap/sticks to hold it shut, handmade/decorated (possibly even scented) papers, etc, etc. No doubt the folks who buy these things also carry a fountain pen with which to engrave their profound thoughts upon the page.

Or just to help them impress other posers.

“Need a clue, take a clue,
 got a clue, leave a clue”