Parsing Japanese with MeCab

Mon 2/14/11 12:59am

2 Comments

Projects

This is a public braindump, to help out anyone who might want to parse Japanese text without being sufficiently fluent in Japanese to read technical documentation. Our weapon of choice will be the morphological analysis tool MeCab, and for sanity’s sake, we’ll do everything in UTF8-encoded Unicode.

The goal will be to take a plain text file containing Japanese text, one paragraph per line, and extract every word in its dictionary form, with the correct reading, with little or no manual intervention.

Step one: get or build UTF8 versions of mecab and mecab-ipadic. There are binary packages available for most platforms, but it’s easy to build your own with the –with-charset=utf8 option to configure, and there’s only one extra step to convert the dictionary index before installing:

.../libexec/mecab/mecab-dict-index -f euc-jp -t utf-8

At this point, you can run mecab $file and see one “word” per line, followed by a comma-separated list containing kanji, kana, and probably a lot of “*”. If you can’t read it, either the file, the dictionary, or your terminal is not set for UTF8. You’ll need to fix that. [Note: I have no idea how to get a standard Windows command shell to display kanji correctly; if you figure it out, let me know.]

Step two: make sense of what we’re looking at. A typical line will look like this:

美しく    形容詞,自立,*,*,形容詞・イ段,連用テ接続,美しい,ウツクシク,ウツクシク

The part before the tab, 美しく, is the word as it actually appears in the text; we’ll be calling it the surface, because that’s what everyone else calls it. The nine comma-separated fields are:

pos - part of speech
pos1 - pos classification 1
pos2 - pos classification 2
pos3 - pos classification 3
rule - how this word is conjugated
conj - what conjugation it's in
dict - what the dictionary form is (with kanji if the surface form has kanji)
reading - the reading, in katakana; convert this to hiragana for use
pronunciation - the reading, in katakana with the long-vowel marker

We don’t care about pos2, pos3, or pronunciation. We need surface, pos, pos1, and reading for all words; for verbs and adjectives, we need conj, and for verbs, we also need rule. Note that there are no lines for the standard ASCII space character; if you want to reassemble the original text from the output of MeCab, you’ll need to preserve spaces (I convert them to underscores and then undo it later).

Step three: understanding the fields. I ran several complete novels through MeCab, and here’s what I found out about each major pos. You’ll be pleased to know that we don’t really care about most of it for this project; I’m just including it for future reference. The bottom line is that you can use the reading field for everything but verbs and i-adjectives (or surface if reading is * or empty), and for those you just need a few simple rules, not a complete understanding of this table.

[Updated: for ichidan verbs, I had missed the case where verbs in conjunctive form (連用形) ended in れ, and incorrectly stripped it before adding る]

形容詞

i-adjective; pos1 can be:

自立 - independent adjective; if conj is ガル接続, just add an い, otherwise chop off any of き, く, くっ, かっ, けれ, し, or っ, and then add an い.
非自立 includes いい, よかっ, にくい, ほしい, やすい, etc. Use reading as-is.
接尾 consists entirely of っぽ and ぽ, as far as I can tell. Use reading as-is.

動詞

verb; pos1 can be:

非自立 includes て, いる, てる, くる, いく, しまう, ください, etc. Use reading as-is.
接尾 includes れる, られる, そう, させ, しめ, 的, etc. Use reading as-is.
自立 - the good stuff. If conj is 基本形, we're already in dictionary form, and can declare victory. Otherwise, we need to look at rule:
- 一段 - ichidan or "ru-dropping" verbs. Just drop the last ra-line character from reading if conj is not 連用形 or 未然形, and then add a ru.
- 一段・クレル - kureru. We're done.
- カ変・クル - kuru, possibly preceded by something like yatte. Replace everything after the -tte with kuru.
- カ変・来ル - kuru again, this time with kanji in surface. ditto.
- サ変・スル - suru. We're done.
- サ変・－スル - something-suru; chop off the last sa-line character and add suru.
- 五段・ワ行... There are a lot of these, but all you need to know is the third character, which I've helpfully marked in bold. MeCab has done all the hard work, so here's all we have to do: chop off the last character of reading and replace it with the indicated line's -u. So, ラ means る, マ means む, and don't forget that ワ means う. Note that if there's only one character in reading, just append rather than chopping.

助動詞

auxiliary verb (ex: なる, ます, たい, べき, なく, らしい, etc)

EOS

sentence-division marker (no text)

副詞

adverb; two known values for pos1:

一般 includes まるで, もっと, よく, やがて, やはり, あっという間に, etc.
助詞類接続 includes まったく, 少し, 必ず, ちょっと, etc.

助詞

particle/postposition pos1 = 接続助詞 are connectors like the て in (verb)てくる

名詞

noun; common pos1 are:

数 includes kanji/roman digits, 百, 千, 億, 何 as なん, ・, and 数
接尾 suffix; includes センチ, どおり, だらけ, いっぱい, そう, ごろ, 人/じん, 内, 別, 名/めい, 君/くん, etc. pos2 further classifies 一般, 人名, 助数詞, 副詞可能, etc
特殊 special; I've found only そ and そう
代名詞 pronoun; 何, 俺, 僕, 君, 私, これ, だれ, みんな, etc
非自立 not independent; 上, くせ, の, ん, 事, 筈/はず, etc. pos2 further classifies 一般, 副詞可能, 助動詞語幹, 形容動詞語幹
サ変接続 v-suru (kanji/katakana); if reading is "*", it's random ascii
副詞可能 adverb form; includes 今, あと, ほか, 今夜, 一番
固有名詞 proper noun; if reading is "*", foreign abbrev. or katakana word
接続詞的 conjunction; consists exclusively of 対/たい in my samples
動詞非自立的 includes ちょうだい, ごらん, and not much else
形容動詞語幹 -na adjective stem
ナイ形容詞語幹 pre-nai adj stem (しょうが, とんでも, 違い, etc)

記号

symbol (ex: 、。・, full-width alphabetic, etc)

接続詞

conjunction (ex: そして, でも, たとえば, だから, 次に, 実は, etc)

連体詞

pre-noun adjective (ex: あの, こんな, 小さな, 同じ, ある, 我が, etc)

感動詞

interjection (ex: さあ, ええ, はい, どうぞ, なるほど, etc)

接頭詞

prefix (ex: お, ご, 全, 大, 真っ, 逆, 両, 最, 新, 悪, 初, etc)

フィラー

filler word (なんか, あの, ええと, etc)

[side note: unusual mimetic words and sound effects may get parsed as verb conjugations, interjections, adjectives, etc; for light novels in particular, there’s a good chance that something that looks like an error in your de-inflection code is actually nonsense in the source.]

Step four: overriding the parser without learning too much about it. For any non-trivial work, you’ll discover that some of the words came out wrong. For instance, 火澄 will be parsed as two words, hi and kiyoshi, instead of the name Kasumi. くノ一 will be parsed as three words, not as kunoichi (“female ninja”). We need to create a user dictionary and pass it to MeCab with the -u option. The file format looks suspiciously familiar:

火澄,1223,1223,-5000,名詞,固有名詞,人名,名,*,*,火澄,カスミ,カスミ
くノ一,1223,1223,-5000,名詞,固有名詞,人名,名,*,*,くノ一,クノイチ,クノイチ

Put this into a file named user.csv and run this command (substituting in the correct directories for the script and the installed dictionary):

.../libexec/mecab/mecab-dict-index -d .../lib/mecab/dic/ipadic -f utf-8 -t utf-8 user.csv

We now have a file named user.dic, which can be passed to mecab with the -u option, and girl ninja Kasumi will appear correctly. You can use exactly the same data in most fields; just put your word in fields 1 and 11, and put katakana versions of the reading into fields 12 and 13.

What do all the other fields mean? Basically, “proper noun, wherever nouns can occur”, and the low value in field 4 gives it higher priority than almost anything else in the database.

Sadly, this does not work with verbs. You can kinda-sorta fudge it sometimes if all that’s wrong is the reading it comes up with, but most likely you need to put different numbers into fields 2 and 3 to get the context right, or your entry will be ignored. I have no idea what the correct values should be. Fortunately, I don’t need to.

For this next trick, we’re going to need an unpacked copy of the MeCab ipadic source distribution. Download it, unpack it, and convert all of the CSV files from the EUC-JP encoding to UTF-8. Linux and Mac users will already have iconv installed somewhere, and can run:

iconv -f euc-jp -t utf-8 $file > $file-utf8

Now, let’s suppose our text includes the common phrase お腹が空いている (“I’m hungry!”). This should be read as “onaka ga suite iru”, but MeCab is going to come back with “…aite iru”. Why? Because when we search Verbs.csv for 空い, we find:

空い,687,687,6316,動詞,自立,*,*,五段・カ行イ音便,連用タ接続,空く,アイ,アイ
空い,687,687,9701,動詞,自立,*,*,五段・カ行イ音便,連用タ接続,空く,スイ,スイ

The lower the number in field 4, the more likely it will be used, so “ai” beats “sui”. We need to copy the second line into our user dictionary and replace the 9701 in field 4 with something lower than 6316. Negatives work nicely.

Step five: hey, we’re done! Using whatever language you’re fond of, you can now parse the output of MeCab into a nice set of (word, reading) pairs, which can be looked up in any convenient online dictionary. Since we’re doing this the easy way, grab a copy of EDICT and convert it from EUC-JP to UTF-8. (processing the complete JMDICT schema is more than a little out of scope, and accounts for ~160 lines of my script)

For now, I’ll spare you the 900 lines of Perl that converts complete novels from plain text to Kindle-sized PDF files with full furigana and matching per-page vocabulary lists; that’s a topic for another day.

[Update: here’s a short list of overrides to clean up common annoyances in MeCab output. Yes, you want to use them. I’ll probably be adding more to this list…]

空い,687,687,-5000,動詞,自立,*,*,五段・カ行イ音便,連用タ接続,空く,スイ,スイ
他,1285,1285,4000,名詞,副詞可能,*,*,*,*,他,ホカ,ホカ
一寸,1285,1285,4000,名詞,一般,*,*,*,*,一寸,チョット,チョット
身体,1285,1285,3000,名詞,一般,*,*,*,*,身体,カラダ,カラダ
達,1291,1291,-1000,名詞,固有名詞,人名,名,*,*,達,タチ,タチ
間,1314,1285,5000,名詞,一般,*,*,*,*,間,アイダ,アイダ
台詞,1285,1285,5000,名詞,一般,*,*,*,*,台詞,セリフ,セリフ

(the last one isn’t terribly common, but it’s a real WTF to see 台詞 read as だいし)