Sunday, September 21 2008

Dictionaries as toys

There are dozens of front-ends for Jim Breen’s Japanese-English and Kanji dictionaries, on and offline. Basically, if it’s a software-based dictionary that wasn’t published in Japan, the chance that it’s using some version of his data is somewhere above 99%.

Many of the tools, especially the older or free ones, use the original Edict format, which is compact and fairly easy to parse, but omits a lot of useful information. It has a lot of words you won’t find in affordable J-E dictionaries, but the definitions and usage information can be misleading. One of my Japanese teachers recommends avoiding it for anything non-trivial, because the definitions are extremely terse, context-free, and often “off”.

A good example is 一か八か, which is translated as simply “sink or swim”. This is one possible usage, and you’ll find it in printed dictionaries (along with “do or die”), but she insisted it felt wrong. I did some digging, and it turns out that it’s an old gambling expression, meaning “to risk your entire stake on one throw of the dice”. In usage, it appears to retain this gambling flavor, making the poker expression “going all-in” a better match (at least in this modern era where even James Bond is reduced to playing poker).

Edict has been through a lot of revisions, and now that it’s generated from the XML-based JMdict file, it’s improving almost daily. Unfortunately, a lot of the JMdict content simply can’t be represented in the Edict format, so tools based on it are inherently less accurate than ones that parse JMdict directly.

[side note: Kanjidic is also now based on an XML file, but that one’s not actively maintained; much of the XML schema is collecting dust, waiting for someone with a whole bunch of patience to clean up the data]

The wwwjdic site uses the full data (along with some nice supplemental sources), but presents it in an extremely compact form that just throws a lot of it away. It also doesn’t let you link to search results, which the Animelab dictionary (based on the same data) does.

This meant, of course, that I had to write my own.

Nearly two years ago, I knocked together a set of functional command-line tools that imported and searched Edict and Kanjidic, and made a stab at accurately importing the XML JMdict. I wasn’t satisfied with the JMdict-based database, though, because I had made the same mistake the current JMdictDB project is making: directly converting the XML DTD into a SQL schema. Mine was less insane than his, but still unworkably complex.

A few weeks ago, I threw all that code away and started from scratch, designing the database schema around searchability. You’re only ever going to print complete entries, so store them as JSON-encoded blobs for portability, and add just enough tables to support “things someone would actually search for”. For now, it’s hosted on an old Shuttle PC at my house that I’m using to prototype my in-progress Movable Type upgrade; eventually, it’ll be moved here.

I’m using recent versions of both JMdict and Kanjidic2, with support for wildcards and output filters, and everything has a permalink. The entire project comes to about 1300 lines of code, including the XML import scripts, the CGI lookup script, and a custom dictionary sort. Plus a whole bunch of supporting libraries, of course.

The formatting needs work. I haven’t done much CSS styling yet, focusing on presenting all of the available data (well, not all of the Kanjidic cross-references yet). Also, I’m currently only expanding the abbreviations as tool-tips, and the presentation of the re_restr, stagk, and stagr fields probably only makes sense to me. The frequency-of-use abbreviations don’t even have expanded versions (not entity-encoded in the original data), so I’ll have to make some.

Implementation notes:

  • the CGI::Fast Perl module works really well, but can introduce some hard-to-find bugs (you have to manually twiddle variables in the CGI package to emulate its pragmas, and if you don’t use the object-oriented methods, you have to be careful never to use formatting calls outside of your while (new CGI::Fast) block).
  • XML::Twig is still the simplest way to parse a really large XML file, and has some very powerful tools for simplifying and normalizing a data structure.
  • Sqlite is a good backend database, and is quite fast if you already have an open db handle (hence the use of mod_fastcgi).
  • Sphinx is a very cool full-text search engine. Some time soon I’ll start using its full-featured import format, which will make the English-to-Japanese search much more useful.
  • Kakasi is an excellent kana-fication and romanization tool with some limited word-splitting functionality, but the documentation is basically impenetrable, and it doesn’t support Unicode.
  • The Text::Kakasi Perl library does support Unicode, and the incantation to convert kanji/kana strings to romaji is:
    Text::Kakasi->new(qw(-s -Ea -Ka -Ha -Ja -iutf8 -outf8))
  • The Lingua::JA::Kana library is good for converting romaji to kana (once you add the missing じゃ row to its internal tables), but quite useless for going the other direction (it took me a moment to figure out what “fuxakku” was supposed to be…).
  • Punycode is an excellent method for creating permalinks to strings of kanji and kana, and has the advantage of being an Internet standard.