This is a public braindump, to help out anyone who might want to parse Japanese text without being sufficiently fluent in Japanese to read technical documentation. Our weapon of choice will be the morphological analysis tool MeCab, and for sanity’s sake, we’ll do everything in UTF8-encoded Unicode.
The goal will be to take a plain text file containing Japanese text, one paragraph per line, and extract every word in its dictionary form, with the correct reading, with little or no manual intervention.
Step one: get or build UTF8 versions of mecab and mecab-ipadic. There are binary packages available for most platforms, but it’s easy to build your own with the –with-charset=utf8 option to configure, and there’s only one extra step to convert the dictionary index before installing:
.../libexec/mecab/mecab-dict-index -f euc-jp -t utf-8
At this point, you can run mecab $file and see one “word” per line, followed by a comma-separated list containing kanji, kana, and probably a lot of “*”. If you can’t read it, either the file, the dictionary, or your terminal is not set for UTF8. You’ll need to fix that. [Note: I have no idea how to get a standard Windows command shell to display kanji correctly; if you figure it out, let me know.]
Step two: make sense of what we’re looking at. A typical line will look like this:
The part before the tab, 美しく, is the word as it actually appears in the text; we’ll be calling it the surface, because that’s what everyone else calls it. The nine comma-separated fields are:
We don’t care about pos2, pos3, or pronunciation. We need surface, pos, pos1, and reading for all words; for verbs and adjectives, we need conj, and for verbs, we also need rule. Note that there are no lines for the standard ASCII space character; if you want to reassemble the original text from the output of MeCab, you’ll need to preserve spaces (I convert them to underscores and then undo it later).
Step three: understanding the fields. I ran several complete novels through MeCab, and here’s what I found out about each major pos. You’ll be pleased to know that we don’t really care about most of it for this project; I’m just including it for future reference. The bottom line is that you can use the reading field for everything but verbs and i-adjectives (or surface if reading is * or empty), and for those you just need a few simple rules, not a complete understanding of this table.
[Updated: for ichidan verbs, I had missed the case where verbs in conjunctive form (連用形) ended in れ, and incorrectly stripped it before adding る]
[side note: unusual mimetic words and sound effects may get parsed as verb conjugations, interjections, adjectives, etc; for light novels in particular, there’s a good chance that something that looks like an error in your de-inflection code is actually nonsense in the source.]
Step four: overriding the parser without learning too much about it. For any non-trivial work, you’ll discover that some of the words came out wrong. For instance, 火澄 will be parsed as two words, hi and kiyoshi, instead of the name Kasumi. くノ一 will be parsed as three words, not as kunoichi (“female ninja”). We need to create a user dictionary and pass it to MeCab with the -u option. The file format looks suspiciously familiar:
Put this into a file named user.csv and run this command (substituting in the correct directories for the script and the installed dictionary):
.../libexec/mecab/mecab-dict-index -d .../lib/mecab/dic/ipadic -f utf-8 -t utf-8 user.csv
We now have a file named user.dic, which can be passed to mecab with the -u option, and girl ninja Kasumi will appear correctly. You can use exactly the same data in most fields; just put your word in fields 1 and 11, and put katakana versions of the reading into fields 12 and 13.
What do all the other fields mean? Basically, “proper noun, wherever nouns can occur”, and the low value in field 4 gives it higher priority than almost anything else in the database.
Sadly, this does not work with verbs. You can kinda-sorta fudge it sometimes if all that’s wrong is the reading it comes up with, but most likely you need to put different numbers into fields 2 and 3 to get the context right, or your entry will be ignored. I have no idea what the correct values should be. Fortunately, I don’t need to.
For this next trick, we’re going to need an unpacked copy of the MeCab ipadic source distribution. Download it, unpack it, and convert all of the CSV files from the EUC-JP encoding to UTF-8. Linux and Mac users will already have iconv installed somewhere, and can run:
iconv -f euc-jp -t utf-8 $file > $file-utf8
Now, let’s suppose our text includes the common phrase お腹が空いている (“I’m hungry!”). This should be read as “onaka ga suite iru”, but MeCab is going to come back with “…aite iru”. Why? Because when we search Verbs.csv for 空い, we find:
The lower the number in field 4, the more likely it will be used, so “ai” beats “sui”. We need to copy the second line into our user dictionary and replace the 9701 in field 4 with something lower than 6316. Negatives work nicely.
Step five: hey, we’re done! Using whatever language you’re fond of, you can now parse the output of MeCab into a nice set of (word, reading) pairs, which can be looked up in any convenient online dictionary. Since we’re doing this the easy way, grab a copy of EDICT and convert it from EUC-JP to UTF-8. (processing the complete JMDICT schema is more than a little out of scope, and accounts for ~160 lines of my script)
For now, I’ll spare you the 900 lines of Perl that converts complete novels from plain text to Kindle-sized PDF files with full furigana and matching per-page vocabulary lists; that’s a topic for another day.
[Update: here’s a short list of overrides to clean up common annoyances in MeCab output. Yes, you want to use them. I’ll probably be adding more to this list…]
空い,687,687,-5000,動詞,自立,*,*,五段・カ行イ音便,連用タ接続,空く,スイ,スイ 他,1285,1285,4000,名詞,副詞可能,*,*,*,*,他,ホカ,ホカ 一寸,1285,1285,4000,名詞,一般,*,*,*,*,一寸,チョット,チョット 身体,1285,1285,3000,名詞,一般,*,*,*,*,身体,カラダ,カラダ 達,1291,1291,-1000,名詞,固有名詞,人名,名,*,*,達,タチ,タチ 間,1314,1285,5000,名詞,一般,*,*,*,*,間,アイダ,アイダ 台詞,1285,1285,5000,名詞,一般,*,*,*,*,台詞,セリフ,セリフ
(the last one isn’t terribly common, but it’s a real WTF to see 台詞 read as だいし)