Monday, April 27 2009

Dictionary update

[Update update: I’ve made a small change to add the full JMnedict name dictionary; a lot of things that used to be in Edict/JMdict have been moved over to this much-larger secondary dictionary, and I finally got around to integrating it. The English translations aren’t searchable yet, mostly because I need to rework the form and add the kanji dictionary to Xapian as well, so that I have J↔E, N↔E, and K↔E.]

One downside of moving a lot of stuff onto my new shared-hosting account is that I have to give up a lot of control over what’s running. Not only do I have to work through an Apache .htaccess file instead of reconfiguring the server directly, but I can’t run my own servers on their machine.

So, goodbye Sphinx search engine, hello Xapian (thanks, Pixy). While it suffers from a lack of documentation between “baby’s first search” and “211-page C++ API document”, it has a lot to offer, and doesn’t require a server. One thing it has is a full-featured query parser, so you can create searches like “pos:noun usage:common lunch -keyword:vulgar” to get common lunch-related nouns that don’t include sexual slang (such as the poorly-attributed usage of ekiben as a sexual position). That allows me to use the same tagging for the E-J searches that I use in Sqlite for the J-E searches. [note: everything’s just filed under “keyword:” in this first pass, and the valid values are the same as the advanced-search checkboxes]

I need a full-text search to do English-Japanese, because the JMdict data isn’t really designed for it. There are hooks in the XML schema, but they’re not used yet. As a result, my search results are a bit half-assed, which makes the new query support useful for refining the results. I can also split out the French, German, and Russian glosses into their own correctly-stemmed searches; with Sphinx, there was one primary body field to search, so all the glosses were lumped together. With a small code change, I can tag each gloss with the correct ISO language code and index them correctly.

The new version is now live on jgreely.net/dict, which means I should be able to move that domain over to the shared-hosting account soon.

Once I figured out how to use Xapian (through the Search::Xapian Perl module, of course), replacing Sphinx and adding the keyword support took a few minutes and maybe half a page of code, total. In theory, I could use it for the J-E searches as well, but I’d lose the ability to put wildcards anywhere in the search string, which comes in handy when I’m trying to track down obscure or obsolete words.

One thing I haven’t figured out is why I can’t use add_term with kanji arguments; both Xapian and Perl are working entirely in Unicode, but passing non-ASCII arguments to add_term throws an error. The workaround is to set the stemmer to “none” and use index_text, and that’s fast enough that I don’t need to worry about it right now.

The most annoying thing about the Xapian documentation is how well-hidden the prefix support is. The details aren’t in the API at all; you can learn how to add them to a term generator or query parser, but the really useful explanation is over in the Omega docs.