Friday, June 19 2009

Using Abbyy FineReader Pro for Japanese OCR

[Update: if you save your work in the Finereader-specific format, then changes you make after that point will automatically be saved when you exit the application; this is not clear from the documentation, and while it’s usually what you want, it may lead to unpleasant surprises if you decide to abandon changes made during that session.]

After several days with the free demo, in which I tested it with sources of varying quality and tinkered with the options, I bought a license for FineReader Pro 9.0 (at the competitive upgrade price for anyone who owns any OCR product). I then spent a merry evening working through an album where the liner notes were printed at various angles on a colored, patterned background. Comments follow:

  • Turn off all the auto features when working with Japanese text.
  • In the advanced options, disable all target fonts for font-matching except MS Mincho and Times New Roman. Don’t let it export as MS Gothic; you’ll never find all of the ー/一 errors.
  • Get the cleanest 600-dpi scan you can. This is sufficient for furigana-sized text on a white background.
  • Set the target language to Japanese-only if your source is noisy or you’re sure there’s no random English in the text. Otherwise, it’s safe to leave English turned on.
  • Manually split and deskew pages if the separation isn’t clean in the scan.
  • Adjust the apparent resolution of scans to set the output font size, before you tell it to recognize the text.
  • Manually draw recognition areas if there’s anything unusual about your layout.
  • Rearrange the windows to put the scan and the recognized text side-by-side.
  • Don’t bother with the spell-checker; it offers plausible alternative characters based on shape, but if the correct choice isn’t there, you have to correct it in the main window anyway. Just right-click as you work through the document to see the same data in context.
  • You can explicitly save in a FineReader-specific format that preserves the entire state of your work, but it creates a new bundle each time, and it won’t overwrite an existing one with the same name. This makes it very annoying when you want to simply save your progress as you work through a long document; each new save includes a complete copy of the scans, which adds up fast.
  • If you figure out how to get it stop deleting every full-width kanji whitespace character, let me know; it’s damned annoying when you’re trying to preserve the layout of a song.
  • Once you’ve told it to recognize the text, search the entire document for these common errors:
    • っ interpreted as つ and vice-versa
    • ー interpreted as 一 and vice-versa; check all other nearby katakana for “small-x as x” errrors while you’re at it
    • 日 interpreted as 曰
    • Any English-style punctuation other than “!”, “:”, “…”, or “?”; most likely, they should be the katakana center-dot, but it might have torn a character apart into random fragments (rare, unless your background is noisy).
    • The digits 0-9; if your source is noisy, random kanji and kana can be interpreted as digits, even when English recognition is disabled.
  • Delete any furigana it happens to recognize, unless you’re exporting to PDF; it just makes a mess in Word.
  • In general, export to Word as Formatted Text, with the “Keep line breaks” and “Highlight uncertain characters” options turned on.
  • If your text is on a halftoned background and you’re getting a lot of errors, load up the scan in Photoshop, use the Strong Contrast setting in Curves, then try out the various settings under Black & White until you find one that gets rid of most of the remaining halftone dots (I had good luck with Neutral Density). After that, you can Despeckle to get rid of most of the remaining noise, and use Curves again to force the text to a solid black.