Saturday, July 24 2010

Using MongoDB

Suppose you had a big XML file in an odd, complicated structure (such as JMdict_e, a Japanese-English dictionary), and you wanted to load it into a database for searching and editing. You could faithfully replicate the XML schema in a relational database, with carefully-chosen foreign keys and precisely-specified joins, and you might end up with something like this.

Go ahead, look at it. I’ll wait. Seriously, it deserves a look. All praise to Stuart for making it actually work, but damn.

Done? Okay, now let’s slurp the whole thing into MongoDB:

#!/usr/bin/perl -CADS
use strict;
use XML::Twig;
use MongoDB;
 
my $mcoll = MongoDB::Connection->new()
  ->get_database('test')->get_collection('dict');
 
my @arrays = qw(k_ele ke_pri ke_inf r_ele re_pri re_inf re_restr
  sense pos field misc dial gloss ant s_inf xref stagk stagr lsource
  audit bibl etym links example pri trans name_type trans_det);
new XML::Twig(
  keep_encoding => 1,
  twig_handlers => {
    entry => sub {
      my ($twig,$element) = @_;
      $mcoll->insert($element->simplify(forcearray=>\@arrays));
      $element->delete;
    },
    re_nokanji => sub { $_->set_text(1) },
  },
)->parsefile("JMdict_e");
 
$mcoll->ensure_index({'k_ele.keb' => 1});
$mcoll->ensure_index({'r_ele.reb' => 1});
$mcoll->ensure_index({'sense.gloss' => 1});
$mcoll->ensure_index({'trans.trans_det' => 1});

This takes about four minutes to run on my Mac. Now let’s query it, using the mongo shell syntax:

% mongo
> use test
 
> db.dict.find({ "r_ele.reb" : "みつぼうえき" })
{
  "_id" : ObjectId("4c4b3b086b5cd7f958581201"),
  "k_ele" : [ { "keb" : "密貿易" } ], 
  "r_ele" : [ { "reb" : "みつぼうえき" } ],
  "sense" : [ {
    "gloss" : [ "smuggling" ],
    "pos" : [ "&n;", "&vs;" ] 
  } ],
  "ent_seq" : "1731560" 
}
 
> db.dict.find({ "k_ele.keb" : /^密貿/ }) 
{
  "_id" : ObjectId("4c4b3b086b5cd7f958581201"),
  "k_ele" : [ { "keb" : "密貿易" } ], 
  "r_ele" : [ { "reb" : "みつぼうえき" } ],
  "sense" : [ {
    "gloss" : [ "smuggling" ],
    "pos" : [ "&n;", "&vs;" ] 
  } ],
  "ent_seq" : "1731560" 
}
 
>  db.dict.find({ "sense.gloss" : /smuggling/, "k_ele.keb" : /貿/ })
{
  "_id" : ObjectId("4c4b3b086b5cd7f958581201"),
  "k_ele" : [ { "keb" : "密貿易" } ], 
  "r_ele" : [ { "reb" : "みつぼうえき" } ],
  "sense" : [ {
    "gloss" : [ "smuggling" ],
    "pos" : [ "&n;", "&vs;" ] 
  } ],
  "ent_seq" : "1731560" 
}

It would be trivial to build a full-featured dictionary tool using any of the languages with a MongoDB library, and you could import the large name dictionary JMnedict and the full multi-language version of JMdict as well.

(Note: you can’t actually run that Perl script as-is on Snow Leopard, because Apple shipped the last version before the “-C” bug was fixed. You have to remove the “-CADS” from the first line of the script and set it in the shell instead, with export PERL_UNICODE=ADS. Naturally, I’ve glossed over some other small issues as well…)