Notes on finishing a novel

Tue 3/8/11 12:50am

Projects

A novel in Japanese, that is, converted into a custom “student edition” at precisely my reading level, as described previously.

Speed and comprehension are good; once I resolved the worst typos, parsing errors, and bugs in my scripts, I was able to read at a comfortable pace with only occasional confusion. Words that didn't get looked up correctly are generally isolated and easy to work out from context, and most of the cases where I had to stop and read a sentence several times turned out to be odd because the thing they were describing was odd (such as what the guard does before allowing the original Kino to enter the city in Natural Rights). Of course, it helps to have a general knowledge of the material.
Coliseum was changed significantly for the animated version of Kino's Journey; the original story leaves most of the opponents shallow and one-dimensional, and spends way too much time on the mechanical details of Kino's surprise (both the preparation the night before, and the detailed description of the physical impact and aftermath). Mother's Love, on the other hand, is a pretty straight adaptation.
Casual speech and dialect don't cause as much problem as you might expect. MeCab handles a lot of the common ones, and recovers well from the ones it has to punt on. They didn't confuse me too often, either. After a while. :-)
One thing that MeCab sometimes gets wrong is when a writer uses pre-masu form instead of te-form when listing a series of actions. I don't have a good example at the moment, but I ran into several where it punted and looked for a noun.
The groups that scan, OCR, and proofread novels tend to miss some simple errors where the software guessed the wrong kanji. A good example is writing 兵士 as 兵土 or 兵上. Light novels generally aren't that complicated, and if a word looks rare or out of place, it may well be an OCR error.
The IPA dictionary used by MeCab has some quirks that make it sub-optimal for use with modern fiction. Reading 空く as あく, 他 as た, 一寸 as いっすん, 間 as ま, 縁 as えん, and 身体 as しんたい are all correct sometimes, but not in some common contexts where their Ipadic priority causes MeCab to guess wrong. Worse, it has a number of relatively high-priority entries that are not in any dictionary I've found: 台詞 as だいし, 胡坐 as こざ, 面す and 脱す as verbs that are more common than 面する and 脱する, etc. It also has no entries for みぞれ, 呆ける, 粘度, 街路樹, and a bunch of others. Oddest of all, there are occasions where it reads 達 as いたる; this is a valid name reading, but name+達 is far more likely to be たち than いたる; some quirk of how it calculates the appropriate left/right contexts when evaluating alternatives, an aspect of the dictionary files that I definitely don't understand.
I need to make better use of the original furigana when evaluating MeCab output. I'm preserving it, but not using it to automatically detect some of the above errors. Mostly because I don't want the scripts to become too interactive. Perhaps just flagging the major differences is sufficient.
On to book 2!