Sunday, February 20 2011

Bridging The Gulf of Kanji

Imagine that you’re reading something in a foreign language that you’ve been studying for a while, and you hit an unfamiliar word. You know how to pronounce it, so you can often tell if it’s a place or a person’s name, and you’re pretty sure how many words you’re looking at, so if you need to look them up, you can.

When studying Japanese, the most frustrating thing about trying to graduate from reading “student material” to “real stuff” is not being able to do that. You’re reading along, feeling pretty good about yourself, and you run smack into a wall of kanji. Maybe it’s someone’s name, maybe it’s a city you’d recognize if you could pronounce it, or maybe it’s something like 厳重機密保持体制.

Taken individually, you know most or all of the characters, but together, wtf? Is it safe to skip over and work out from context, or do you need to carefully look up each character, crossing your fingers and hoping that it’s a straightforward collection of two-character nouns (which it is, by the way; literally “strictly-classified-preservation-system”, or, more loosely, “seriously top-secret”). Every time you stop to look something like this up, you lose continuity, and instead of reading, you’re deciphering.

I am far from the first person to notice this, and there are some well-developed tools for helping you read a Japanese web site, of which perhaps the best-known is Rikai. I don’t use it. I will occasionally use the built-in pop-up J-E dictionary on my Mac, which is a simpler version of the same thing, but what I really want to do is read books and short stories.

I’ve painstakingly scrawled unfamiliar kanji to look words up in the most useful handheld dictionary available (aka Nintendo DS Lite with Kanji Sonomama), I’ve scanned and OCR’d pages in Abbyy FineReader (quite successfully), etc, and managed to finish a small number of relatively easy stories and novels, but it’s still agonizingly slow, and a slight transcription error can leave you scratching your head for hours.

Back when Foothill still had a group reading class, I started building a set of tools to produce custom student editions of stories as cleanly-formatted PDF files with matching vocabulary lists (using Microsoft Word’s excellent HTML import and Japanese layout support to produce vertical text with furigana). Every once in a while, I’ve gone back and extended the tools a bit, but I was still missing two key things: a good source of texts, and something to help me break up those walls of kanji into separate words.

Aozora Bunko is a source of texts, and there’s some really great stuff there, but 80-year-old literature isn’t at the top of my reading list. I wanted something more contemporary, and I eventually discovered an active online community in Japan that scans, OCRs, and proofreads light novels in large quantity. I refuse to hide behind freetard philosophy or poor budgeting skills, so as a general rule, I buy the printed books, and then download them.

Morphological analysis tools like Chasen and MeCab looked like exactly what I needed for breaking down the walls, but every time I tried to use one of them, I got poor results that I could only improve by reading documentation written in technically-oriented Japanese. Chicken, egg. Catch-22. Etc. A few weeks back, I got stubborn about it, and tinkered with MeCab until it all made sense, and then it was time to put the pieces together.

We begin with a file containing a complete story or novel, formatted as plain text with the simple furigana markup pioneered by Aozora Bunko (漢字《かんじ》 or |漢字《かんじ》 ; the semi-freeform [#…] markup is out of scope). This will likely be encoded as Windows Code Page 932, which we will magically convert to UTF-8. I hate the AB markup, so we’ll also be replacing that with this: {漢字|かんじ} , and adding a third field to uniquely identify vocabulary words: {漢字|かんじ@id000405}.

All markup needs to be stripped to get the best results out of MeCab, so the original (likely minimal) furigana will be saved away in a config file during the conversion, for later comparison and tweaking. My earlier MeCab braindump explains the next step in detail; it usually takes three or four passes to find all the exceptions and overrides in a novel.

The output of that process is a text file containing the vocabulary list, and a LaTeX document containing very simple markup:

\usepackage[paperwidth=11.5cm, paperheight=9cm, top=0.25cm, bottom=0.25cm, left=0.25cm, right=0.25cm]{geometry}
\hypersetup{pdfinfo={Title={Ame no Naka ni Shinu},Author={Kyoutarou Nishimura}}}
 雨が\label{id000003}\kana{降}{\special{::tag id000003}ふ}っていた。
 冷たい冬の雨である。\label{id000004}みぞれに\label{id000005}\kana{近}{\special{::tag id000005}ちか}かった。

This will not work as-is in any standard TeX distribution. I’ve managed to reproduce the steps required to get it to work with TeXLive 2010, but I haven’t typed them up yet. Short version: hack on dvipdfmx.cfg to get it to see your TrueType/OpenType kanji fonts, use an old version of geometry.sty, and convert furikana.sty (found here) from JIS to UTF-8.

The key elements are the \kana{}{} and \label{} tags, which tell TeX to typeset the furigana while keeping track of the page number each id# was used on. The TeX .aux file will contain a bunch of easily-parsed (id#,page#) pairs which we’ll cross-reference with the vocabulary file from the previous step, producing another LaTeX document.

The end result is a pair of PDF files designed to be viewed side-by-side. Sadly, Kindles and iPads and such are built by people who think one open document at a time is enough, making it tedious even to switch back and forth, so I’m actually putting the books on a Kindle and the vocabulary sheets on an old Sony Reader. Tape them together and it’s almost like a real book!

But wait, what about those funky-looking \special{::tag id#} tags that are actually embedded into the furigana? This is where we go from tinkering to madness, and the weapon we shall wield is Jin-Hwan Cho’s dviasm, which converts TeX output into a more… “molestable” format. We’re going to use it to remove some of the furigana we just persuaded TeX to typeset. Furigana is a wonderful crutch, but it’s still a crutch, and if we’re going to learn to read without it, we have to occasionally see new words without it.

The common practice in Japan is to include furigana the first time an unusual word appears, until either the next logical section of the document, or the next two-page spread. For Kindle-y use, once per page makes the most sense, which also works well with our vocabulary list. How to get rid of them? Take a look at the dviasm output:

[page 1 0 0 0 0 0 0 0 0 0]
    right: -11.546585pt
    xxx: '::tag id000005'
    fnt: tmin6 at 6pt
    set: 'ち'
    right: 0.000015pt
    set: 'か'

The “xxx:” field is our \special, and when we see it, it means we’re inside of a push/pop pair that contains that entire string of furigana. So, if we’ve already seen that id# on the current page, we can strip out everything between the innermost push and pop. The rest can safely be spat out and converted back to the native TeX output format with dviasm, and then converted to PDF without otherwise changing the layout.

The whole process, from downloaded novel to custom “student edition”, takes about 20 seconds on my laptop. I’ve read several short stories this week, and am currently working my way through a novel at a quite pleasant pace. Not only does the easy vocabulary lookup maintain the flow of reading, but knowing that any word not on the list is one that I have personally declared “known vocabulary” is a real confidence-builder. And the words I declare “known” in one book carry over to the next one, or to a revised edition of the current one. At this point, my biggest problem is my relentless desire to tinker with the scripts rather than read the books they create…

Now, in case anyone managed to read all the way to the bottom, and knows more Japanese than I do, what the hell does 空辣/くうらつ mean? Seriously, wtf? It’s not in any dictionary, but it shows up in casual use in books and blogs as if everyone should just know. My best guess is that it’s abbreviated from 空_辣腕, with some unknown 空-word as the first half, but that doesn’t really help. I can use the literal meaning of the two kanji, but that may be way off.

Oh, and yes, I’ll release the Perl scripts sometime soon. I need to sort out the various dependencies in pTeX, dvipdfmx, MeCab, and Perl, and make some sense of the font configuration.