Jim Breen’s KANJIDIC includes cross-references for various printed kanji dictionaries, and KANJIDIC2 adds more. I’ve imported KANJIDIC into a SQLite database for use by my Perl scripts, and it’s quite handy (and much faster than repeatedly slurping in the original file and parsing it…).
However, it’s missing two cross-reference indexes that would be quite useful for me: JLPT level and White Rabbit Kanji Flashcards card number.
Most of the online JLPT references predate the 2002 test specifications, so the only reliable source I’ve found is The JLPT Study Page. The creator of that site is working from the latest edition of the test content specs, so apart from the occasional typo in the vocabulary, it’s solid data. It just wasn’t in a form directly useful to me, so I screen-scraped it and generated a simple text file, UTF-8 encoded.
The White Rabbit folks have an online lookup tool so you can generate your own cross-reference lists, but by the time I’d found it, I’d already read the forum article that explains their numbering scheme: Unicode sort order within JLPT level. A few seconds at the shell, and I had another simple text file (extended to include the planned Level 1 card set).
In theory, I’m still working at Digeo. In practice, not so much. As we wind our way closer to the layoff date, I have less and less actual work to do, and more and more “anticipating the future failures of our replacements”. On the bright side, I’ve had a lot of time to study Japanese and prepare for Level 3 of the JLPT, which is next weekend.
I’m easily sidetracked, though, and the latest side project is importing the freely-distributed JMdict/EDICT and KANJIDIC dictionaries into a database and wrapping it with Perl, so that I can more easily incorporate them into my PDF-generating scripts.
Unfortunately, all of the tree-based XML parsing libraries for Perl create massive memory structures (I killed the script after it got past a gig), and the stream-based ones don’t make a lot of sense to me. XML::Twig‘s documentation is oriented toward transforming XML rather than importing it, but it’s capable of doing the right thing without writing ridiculously unPerly code:
my $twig = new XML::Twig( twig_handlers => { entry => &parse_entry }); $twig->parsefile('JMdict'); sub parse_entry { my $ref = $_[1]->simplify; print "Entry ID=",$ref->{ent_seq},"\n"; #... $_[1]->delete; }
SQLite was the obvious choice for a back-end database. It’s fast, free, stable, serverless, and Apple’s supporting it as part of Core Data.
Putting it all together meant finally learning how to read XML DTDs, getting a crash course in SQL database design to lay out the tables in a sensible way, and writing useful and efficient queries. I’m still working on that last part, but I’ve gotten far enough that my lookup tool has basic functionality: given a complete or partial search term in kanji, kana, or English, it returns the key parts of the matching entries. Getting all of the available data assembled requires both joins and multiple queries, which is tedious to sort out.
I started with JMdict, which is a lot bigger and more complicated, so importing KANJIDIC2 is going to be easy when I get around to it. That won’t be this week, though, because the JLPT comes but once a year, and finals week comes right after.
[side note: turning off auto-commit and manually committing after every 500th entry cut the import time by 2/3]
The names given to Japanese bands are often peculiar. The folks at Hello!Project have a good track record in this regard, with their 2005 “shuffle” projects having the names セクシーオトナジャン, エレジーズ, and プリプリピンク.
Despite the use of katakana in the names, only one of these is a true loanword, “Elegies”. The other two translate to, respectively, “Sexy Grownups?” and either “Angry Pink” or “Stinky Pink”. They may not be angry, and they probably smell nice, but they’re definitely pink.
But that’s not the puripuri I’m writing about today. I tripped across a completely different use of the word this morning at Kinokuniya. PuriPuri – The Premature Priest:
Five volumes (so far!) of Catholic-school fan service. I don’t see a catgirl, but there’s a meganekko witch on the cover of volume two, and volume three apparently features The Three MusketeersLust-a-teers.
The artist’s official web site includes this nice sample from the volume three cover.
Pete linked to this Aya Matsuura fansite in my comments. It’s nicely laid out, and seems to cover her career quite comprehensively. My favorite part is the DVD review section, which includes a 評価グラフ that rates each release by:
Being a purely subjective scale, many of them go beyond 100%. This one’s Ayaya-do is off the charts. I can’t imagine why…
But Brian, isn’t it always?
Oh, you meant the online store. Apparently it’s sending people all over the world today, and not just those who took the 7.0.2 update. I’d guess that their region-guessing heuristic goes by netblocks, and the database got corrupted somehow.
[I was tempted to title this entry 「また会う日まで」, in response to Brian’s Mr. Roboto quote, but it doesn’t really work, even though it’s the refrain in one of the songs pictured]
Given an old ClarisWorks 4 Japanese document and a working copy of ClarisWorks 4 Japanese Edition, in theory you can just export it to Word, and the Japanese text will be translated correctly. If you are running on an Intel-based Mac, this won’t work. Actually, unless you can do a clean reinstall of CW4JP under a copy of Classic that has the Japanese language kit installed, it might not work at all.
AppleWorks 6, however, which is still available in stores and runs nicely under Rosetta, will. At least, with a little help.
If your AppleWorks install is damaged or incomplete, it won’t be able to open CW4JP docs; you’ll have to reinstall. The English version bundled with PowerPC-based Macs for a long time works just fine, so if you’ve got one, you don’t need to buy the retail package. I use the copy that came with my PowerBook, which was automatically transferred to my new MacBook.
Note that DataViz MacLinkPlus, for all its virtues, hasn’t the slightest idea what to do with Japanese editions of the software formats it supports.
San Francisco is looking to invigorate its Japantown with an infusion of pop culture. I’ve been insisting for a while now that Otaku Tokyo is one of the few colorful themes left unlicensed for a major Las Vegas casino, so perhaps this will help show the money-men the power of kawaii.