I’ve gotten pretty good at transcribing Japanese stories and articles, using my DS Lite and Kanji sonomama to figure out unfamiliar kanji words, but it’s still a slow, error-prone process that can send me on half-hour detours to figure out a name or obsolete character. So, after googling around for a while, I downloaded the free 15-day demo of FineReader Pro and took it for a spin. Sadly, this is Windows software, so I had to run it in a VMware session; the only product that claims to have a kanji-capable Mac product has terrible reviews and shows no sign of recent updates.
First test: I picked up a book (Nishimura’s murder mystery collection Ame no naka ni shinu), scanned a two-page spread at 600-dpi grayscale, and imported it into FineReader. I had to shut off the auto-analysis features, turn on page-splitting, and tell it the text was Japanese. It then correctly located the two vertically-written pages and the horizontally-written header, deskewed the columns (neither page was straight), recognized the text, and exported to Word. Then I found the option to have it mark suspect characters in the output, and exported to Word again. :-)
Results? Out of 901 total characters, there were 10 errors: 6 cases of っ as つ, one あ as ぁ, one 「 as ー, one 呟 as 眩, and one 駆 recognized as 蚯. There were also two extra “.” inserted due to marks on the page, and a few places where text was randomly interpreted as boldface. Both of the actual kanji errors were flagged as suspect, so they were easy to find, and the small-tsu error is so common that you might as well check all large-tsu in the text (in this case, the correct count should have been 28 っ and 4 つ). It also managed to locate and correctly recognize 3 of the 9 instances of furigana in the scan, ignoring the others.
I’d already typed in that particular section, so I diffed mine against theirs until I had found every error. In addition to FineReader’s ten errors, I found two of mine, where I’d accepted the wrong kanji conversion for words. They were valid kanji for those words, but not the correct ones, and multiple proofreadings hadn’t caught them.
The second test was a PDF containing scanned pages from another book, whose title might be loosely translated as “My Youth with Ultraman”, by the actress who played the female team member in the original series. I’d started with 600-dpi scans, carefully tweaked the contrast until they printed cleanly, then used Mac OS X Preview to convert them to a PDF. It apparently downsampled them to something like 243 dpi, but FineReader was still able to successfully recognize the text, with similar accuracy. Once again, the most common error was small-tsu, the kanji errors were flagged as suspect, and the others were easy to find.
For amusement, I tried Adobe Acrobat Pro 9.1’s language-aware OCR on the same PDF. It claimed success and looked good on-screen, but every attempt to export the results produced complete garbage.
Both tests were nearly best-case scenarios, with clean scans, simple layouts, and modern fonts at a reasonable size. I intend to throw some more difficult material at it before the trial expires, but I’m pretty impressed. Overall, the accuracy was 98.9%, but when you exclude the small-tsu error, it rises to 99.6%, and approaches 99.9% when you just count actual kanji errors.
List price is $400, but there’s a competitive upgrade available for customers with a valid license for any OCR software for $180. Since basically every scanner sold comes with low-quality OCR software, there’s no reason for most people to spend the extra $220. They use an activation scheme to prevent multiple installs, but it works flawlessly in a VMware session, so even if I didn’t own a Mac, that’s how I’d install it.
[updates after the jump]
Update 1: I decided to try something considerably rougher, namely a heavily-structured dictionary page with white-on-black headers, etc, held in the sunlight with one hand while snapping a photo in the other. It was a glorious mess, with variable brightness, curved text lines (since I was only holding it from one edge), and soft focus in some areas. I expected FineReader to choke, and it did. A clean scan of the same spread worked quite well, except that the analyzer couldn’t find the left margin in the heavily-curved second page, and constructed a complex text region that excluded most of the first two columns. Replacing that with a simple rectangle gave at least 95% recognition, preserving the formatting and structure except for random conversions for 【, 】, ●, and the unusual boxed digits this dictionary uses to refer back to individual senses in a definition. It has training features, so I suspect I could improve the results further by adding those oddballs in.
I also threw a clean scan of a recipe at it, and it correctly detected the horizontally-written table of ingredients as well as the vertically-written instructions, including the circled digits used to mark individual steps. It even did a pretty good job with the flowery kanji script used for the recipe title and subtitles.
Update 2: I had a scan of halftoned red lyrics on white and white lyrics on halftoned red, at roughly 200 dpi; I’d done some cleanup to minimize the halftone artifacts, but the characters were still slightly distorted. FineReader complained about the resolution, but still managed to achieve its usual results. On the first try, it completely mangled the English that was mixed in with the Japanese, but once I selected both languages, it handled everything correctly.
Update 3: How about a page of lyrics that not only suffers from slightly low resolution, but that compounds the problem by printing them rotated on a patterned background? This one’s interesting enough that I’m posting the original image as well as a picture of the results. FineReader correctly rotated the two sections of the page, but doing so blurred some kanji beyond recognition, as you can see. Suspect characters are marked in blue, and you can see that it had low confidence in some of the things it got right, and most of the things it got wrong. When I dug out the original 600-dpi scan of this page, FineReader scored 100% on both songs.