Friday, November 24 2006

JMdict + XML::Twig + DBD::SQLite = ?

In theory, I’m still working at Digeo. In practice, not so much. As we wind our way closer to the layoff date, I have less and less actual work to do, and more and more “anticipating the future failures of our replacements”. On the bright side, I’ve had a lot of time to study Japanese and prepare for Level 3 of the JLPT, which is next weekend.

I’m easily sidetracked, though, and the latest side project is importing the freely-distributed JMdict/EDICT and KANJIDIC dictionaries into a database and wrapping it with Perl, so that I can more easily incorporate them into my PDF-generating scripts.

Unfortunately, all of the tree-based XML parsing libraries for Perl create massive memory structures (I killed the script after it got past a gig), and the stream-based ones don’t make a lot of sense to me. XML::Twig’s documentation is oriented toward transforming XML rather than importing it, but it’s capable of doing the right thing without writing ridiculously unPerly code:

my $twig = new XML::Twig(
        twig_handlers => { entry => \&parse_entry });
sub parse_entry {
        my $ref = $_[1]->simplify;
        print "Entry ID=",$ref->{ent_seq},"\n";

SQLite was the obvious choice for a back-end database. It’s fast, free, stable, serverless, and Apple’s supporting it as part of Core Data.

Putting it all together meant finally learning how to read XML DTDs, getting a crash course in SQL database design to lay out the tables in a sensible way, and writing useful and efficient queries. I’m still working on that last part, but I’ve gotten far enough that my lookup tool has basic functionality: given a complete or partial search term in kanji, kana, or English, it returns the key parts of the matching entries. Getting all of the available data assembled requires both joins and multiple queries, which is tedious to sort out.

I started with JMdict, which is a lot bigger and more complicated, so importing KANJIDIC2 is going to be easy when I get around to it. That won’t be this week, though, because the JLPT comes but once a year, and finals week comes right after.

[side note: turning off auto-commit and manually committing after every 500th entry cut the import time by 2/3]