Suppose you had a big XML file in an odd, complicated structure (such as JMdict_e, a Japanese-English dictionary), and you wanted to load it into a database for searching and editing. You could faithfully replicate the XML schema in a relational database, with carefully-chosen foreign keys and precisely-specified joins, and you might end up with something like this.
Go ahead, look at it. I’ll wait. Seriously, it deserves a look. All praise to Stuart for making it actually work, but damn.
…
Done? Okay, now let’s slurp the whole thing into MongoDB:
#!/usr/bin/perl -CADS
use strict;
use XML::Twig;
use MongoDB;
my $mcoll = MongoDB::Connection->new()
->get_database('test')->get_collection('dict');
my @arrays = qw(k_ele ke_pri ke_inf r_ele re_pri re_inf re_restr
sense pos field misc dial gloss ant s_inf xref stagk stagr lsource
audit bibl etym links example pri trans name_type trans_det);
new XML::Twig(
keep_encoding => 1,
twig_handlers => {
entry => sub {
my ($twig,$element) = @_;
$mcoll->insert($element->simplify(forcearray=>\@arrays));
$element->delete;
},
re_nokanji => sub { $_->set_text(1) },
},
)->parsefile("JMdict_e");
$mcoll->ensure_index({'k_ele.keb' => 1});
$mcoll->ensure_index({'r_ele.reb' => 1});
$mcoll->ensure_index({'sense.gloss' => 1});
$mcoll->ensure_index({'trans.trans_det' => 1});
This takes about four minutes to run on my Mac. Now let’s query it, using the mongo shell syntax:
% mongo
> use test
> db.dict.find({ "r_ele.reb" : "みつぼうえき" })
{
"_id" : ObjectId("4c4b3b086b5cd7f958581201"),
"k_ele" : [ { "keb" : "密貿易" } ],
"r_ele" : [ { "reb" : "みつぼうえき" } ],
"sense" : [ {
"gloss" : [ "smuggling" ],
"pos" : [ "&n;", "&vs;" ]
} ],
"ent_seq" : "1731560"
}
> db.dict.find({ "k_ele.keb" : /^密貿/ })
{
"_id" : ObjectId("4c4b3b086b5cd7f958581201"),
"k_ele" : [ { "keb" : "密貿易" } ],
"r_ele" : [ { "reb" : "みつぼうえき" } ],
"sense" : [ {
"gloss" : [ "smuggling" ],
"pos" : [ "&n;", "&vs;" ]
} ],
"ent_seq" : "1731560"
}
> db.dict.find({ "sense.gloss" : /smuggling/, "k_ele.keb" : /貿/ })
{
"_id" : ObjectId("4c4b3b086b5cd7f958581201"),
"k_ele" : [ { "keb" : "密貿易" } ],
"r_ele" : [ { "reb" : "みつぼうえき" } ],
"sense" : [ {
"gloss" : [ "smuggling" ],
"pos" : [ "&n;", "&vs;" ]
} ],
"ent_seq" : "1731560"
}
It would be trivial to build a full-featured dictionary tool using any of the languages with a MongoDB library, and you could import the large name dictionary JMnedict and the full multi-language version of JMdict as well.
(Note: you can’t actually run that Perl script as-is on Snow Leopard, because Apple shipped the last version before the “-C” bug was fixed. You have to remove the “-CADS” from the first line of the script and set it in the shell instead, with export PERL_UNICODE=ADS. Naturally, I’ve glossed over some other small issues as well…)