Saturday, March 12 2011

Revisiting Louie

Nearly three years ago, I had my first real success at reading Japanese prose written for a native audience. Getting through 30 pages of the first Rune Soldier Louie novel was a big accomplishment, given that I had to look up more than 600 new vocabulary words by painstakingly writing the kanji on my DS Lite or looking them up in printed dictionaries. It took nearly a month, an hour or two at a time.

That was before the demise of my group reading class, and my Japanese hasn’t improved very much since then. I’ve been treading water while waiting for Ooma to grow out of the startup lifestyle, and, yeah, that ain’t happened yet. My new scripts made it possible to read a complete novel in a reasonable time, but while the Rune Soldier novels have been scanned in, no one has gotten around to OCRing them. So I’m doing it.

  1. A ~1200x1800 PNG is adequate for Japanese OCR with Abbyy Finereader Pro (Windows only; the shiny new Mac App Store version does not include Japanese), but not great. It flags almost all of its possible errors, but there are maybe a dozen kanji per page that have to be checked, and the low resolution results in a number of small-kana errors and random guesswork.
  2. JPEG just sucks for OCR; I really wouldn’t want to proof a series that was only available as JPEGs.
  3. The scans for some series that haven’t been OCRd are only ~800x1200; even as PNG, those can’t be fun to OCR. Time to build a DIY Bookscanner!
  4. My scripts currently don’t handle oddball furigana well; in Rune Soldier, a number of ordinary words are given phonetically-written English readings, some quite long, and they create layout problems in pLaTeX.
  5. I need to figure out how to tell pLaTeX to break lines more aggressively; the small page size and tight margins of the Kindle means that a sloppy line break can leave an entire character offscreen; rare, but annoying.

That said, I successfully OCRd and proofed those same thirty pages that I read three years ago, ran them through my scripts, and read the story. It took about two hours to prep, and another two hours to read. I found some more errors that need correcting, but the first pass was perfectly readable.

I’ve also formatted and re-read Nishimura’s Ame no Naka ni Shinu, and the Kino stories Kioku no Kuni and Watashi no Kuni. I’m going to hold off on OCRing the rest of Rune Soldier 1 for a while, though, and focus on reading what I’ve got, which includes the second Kino novel and Tsutsui’s well-known Toki o Kakeru Shoujo. Oh, and I just remembered that copy of Kanjousen Pete typed in; that one’s already prepped for formatting.