Japan

Using Abbyy FineReader Pro for Japanese OCR


[Update: if you save your work in the Finereader-specific format, then changes you make after that point will automatically be saved when you exit the application; this is not clear from the documentation, and while it’s usually what you want, it may lead to unpleasant surprises if you decide to abandon changes made during that session.]

After several days with the free demo, in which I tested it with sources of varying quality and tinkered with the options, I bought a license for FineReader Pro 9.0 (at the competitive upgrade price for anyone who owns any OCR product). I then spent a merry evening working through an album where the liner notes were printed at various angles on a colored, patterned background. Comments follow:

  • Turn off all the auto features when working with Japanese text.
  • In the advanced options, disable all target fonts for font-matching except MS Mincho and Times New Roman. Don't let it export as MS Gothic; you'll never find all of the ー/一 errors.
  • Get the cleanest 600-dpi scan you can. This is sufficient for furigana-sized text on a white background.
  • Set the target language to Japanese-only if your source is noisy or you're sure there's no random English in the text. Otherwise, it's safe to leave English turned on.
  • Manually split and deskew pages if the separation isn't clean in the scan.
  • Adjust the apparent resolution of scans to set the output font size, before you tell it to recognize the text.
  • Manually draw recognition areas if there's anything unusual about your layout.
  • Rearrange the windows to put the scan and the recognized text side-by-side.
  • Don't bother with the spell-checker; it offers plausible alternative characters based on shape, but if the correct choice isn't there, you have to correct it in the main window anyway. Just right-click as you work through the document to see the same data in context.
  • You can explicitly save in a FineReader-specific format that preserves the entire state of your work, but it creates a new bundle each time, and it won't overwrite an existing one with the same name. This makes it very annoying when you want to simply save your progress as you work through a long document; each new save includes a complete copy of the scans, which adds up fast.
  • If you figure out how to get it stop deleting every full-width kanji whitespace character, let me know; it's damned annoying when you're trying to preserve the layout of a song.
  • Once you've told it to recognize the text, search the entire document for these common errors:
    • っ interpreted as つ and vice-versa
    • ー interpreted as 一 and vice-versa; check all other nearby katakana for "small-x as x" errrors while you're at it
    • 日 interpreted as 曰
    • Any English-style punctuation other than "!", ":", "…", or "?"; most likely, they should be the katakana center-dot, but it might have torn a character apart into random fragments (rare, unless your background is noisy).
    • The digits 0-9; if your source is noisy, random kanji and kana can be interpreted as digits, even when English recognition is disabled.
  • Delete any furigana it happens to recognize, unless you're exporting to PDF; it just makes a mess in Word.
  • In general, export to Word as Formatted Text, with the "Keep line breaks" and "Highlight uncertain characters" options turned on.
  • If your text is on a halftoned background and you're getting a lot of errors, load up the scan in Photoshop, use the Strong Contrast setting in Curves, then try out the various settings under Black & White until you find one that gets rid of most of the remaining halftone dots (I had good luck with Neutral Density). After that, you can Despeckle to get rid of most of the remaining noise, and use Curves again to force the text to a solid black.

Abbyy FineReader Pro 9.0, quick tests


I’ve gotten pretty good at transcribing Japanese stories and articles, using my DS Lite and Kanji sonomama to figure out unfamiliar kanji words, but it’s still a slow, error-prone process that can send me on half-hour detours to figure out a name or obsolete character. So, after googling around for a while, I downloaded the free 15-day demo of FineReader Pro and took it for a spin. Sadly, this is Windows software, so I had to run it in a VMware session; the only product that claims to have a kanji-capable Mac product has terrible reviews and shows no sign of recent updates.

First test: I picked up a book (Nishimura’s murder mystery collection Ame no naka ni shinu), scanned a two-page spread at 600-dpi grayscale, and imported it into FineReader. I had to shut off the auto-analysis features, turn on page-splitting, and tell it the text was Japanese. It then correctly located the two vertically-written pages and the horizontally-written header, deskewed the columns (neither page was straight), recognized the text, and exported to Word. Then I found the option to have it mark suspect characters in the output, and exported to Word again. :-)

Results? Out of 901 total characters, there were 10 errors: 6 cases of っ as つ, one あ as ぁ, one 「 as ー, one 呟 as 眩, and one 駆 recognized as 蚯. There were also two extra “.” inserted due to marks on the page, and a few places where text was randomly interpreted as boldface. Both of the actual kanji errors were flagged as suspect, so they were easy to find, and the small-tsu error is so common that you might as well check all large-tsu in the text (in this case, the correct count should have been 28 っ and 4 つ). It also managed to locate and correctly recognize 3 of the 9 instances of furigana in the scan, ignoring the others.

I’d already typed in that particular section, so I diffed mine against theirs until I had found every error. In addition to FineReader’s ten errors, I found two of mine, where I’d accepted the wrong kanji conversion for words. They were valid kanji for those words, but not the correct ones, and multiple proofreadings hadn’t caught them.

The second test was a PDF containing scanned pages from another book, whose title might be loosely translated as “My Youth with Ultraman”, by the actress who played the female team member in the original series. I’d started with 600-dpi scans, carefully tweaked the contrast until they printed cleanly, then used Mac OS X Preview to convert them to a PDF. It apparently downsampled them to something like 243 dpi, but FineReader was still able to successfully recognize the text, with similar accuracy. Once again, the most common error was small-tsu, the kanji errors were flagged as suspect, and the others were easy to find.

For amusement, I tried Adobe Acrobat Pro 9.1’s language-aware OCR on the same PDF. It claimed success and looked good on-screen, but every attempt to export the results produced complete garbage.

Both tests were nearly best-case scenarios, with clean scans, simple layouts, and modern fonts at a reasonable size. I intend to throw some more difficult material at it before the trial expires, but I’m pretty impressed. Overall, the accuracy was 98.9%, but when you exclude the small-tsu error, it rises to 99.6%, and approaches 99.9% when you just count actual kanji errors.

List price is $400, but there’s a competitive upgrade available for customers with a valid license for any OCR software for $180. Since basically every scanner sold comes with low-quality OCR software, there’s no reason for most people to spend the extra $220. They use an activation scheme to prevent multiple installs, but it works flawlessly in a VMware session, so even if I didn’t own a Mac, that’s how I’d install it.

[updates after the jump]

more...

Engrish Pop Quiz


I saw something at Daiso a while back that I thought would make an amusing gift for my sister. On the back was found this label:

Caution: Engrish In Use

Now, what’s the product?

more...

Hello!Project Sing-along Project


Kusumi: “…and I’m too sexy for my hat, Too sexy for my hat, what do you think about that?”

more...

Tenso reshipping service, preliminary report


I’ve heard mixed reports about the various companies that act as reshipping agents in Japan, allowing you to order from companies who only ship domestically. Danny Choo recently had a prominent link to Tenso, along with a contest where the prize included shopping and shipping. There were relatively few comments about the service, but they were positive. I haven’t found many other comments about them, either, but I thought I’d give it a shot.

Several times a year, I place large orders with Amazon Japan. They only ship by air, so the order needs to be large to bring the shipping cost down to a less heartbreaking percentage of the price. They won’t ship software or consumer electronics internationally, and the marketplace vendors won’t ship anything overseas, so it’s been a limited-but-useful way to get stuff.

Tenso ships EMS, charging by weight, and in some cases this may end up being higher than Amazon’s air shipping; now that I have a few invoices to compare, I’ll have to figure out when it makes sense to use them for new goods, figuring in the cost of Amazon Prime to get free domestic shipping to their warehouse. For this test, I took advantage of Amazon’s free one-month trial of Prime. [Update: found Amazon’s rate page again; ¥1700-2700 shipping depending on the contents, plus a fixed ¥300 handling per item]

For used goods? No contest. A lot of marketplace dealers charge a nominal ¥1 + handling for used books and CDs that aren’t in high demand, and I found a single dealer who had three items that I wanted, -azb-アマゾン店. Their handling charge added ¥1020, and Tenso charged ¥2350 for shipping and handling, for a grand total of ¥3373. The original retail cost of the three items? ¥4819. Speed? I ordered on the 12th, Tenso received it on the 16th, shipped it on the 17th, and it was waiting for me at the office today, the 21st. It may actually have arrived yesterday; I was out.

Setting up my account with Tenso was easy, except for the credit card. Neither my Visa nor my Mastercard were accepted, despite having used both with Amazon, but my American Express card worked fine. The error page for this was the only place I noticed where their mostly-competent English was replaced by Japanese, and some of Danny Choo’s commenters reported the same difficulty, and ended up using Paypal. They give clear instructions on how to enter your personal address on online stores, and promptly notify you when you need to approve a shipment.

Will I use them again? Definitely for used goods through Amazon, likely for software/games that Amazon won’t ship directly, possibly for other stores if I find something I really want.

Were the three used items worth it? Hell, yeah.

more...

The limits of Kanji Sonomama


For the first time in a long time, I had to pull out my other electronic dictionary. Why? Because the short essay I was trying to read was filled with place-names. On the DS, I had to write one character at a time, hope it was used at the start of some word (Kanji Sonomama doesn’t have a true kanji dictionary), and then type each one in on my Mac and look them up in Enamdict.

My other dictionary, a Sharp Papyrus, has clumsier stylus input and a generally less useful interface, but a much wider variety of dictionaries, including names and places.

Even with both handhelds and my JMdict search tools, it’s still a tough slog, because Ikuma Dan writes in colorful, literary language, using pre-war orthography. For instance, 眼 for “me”, 筈 for “hazu”, 儘 for “mama”, 未だ for “mada”, 又 for “mata”, and my favorite, 何處 for “doko” (處 being an obsolete variant of 処).

It’s been an interesting experience, but one I won’t repeat any time soon; in the time it takes me to decipher two pages of his writing, I could read thirty pages of a children’s or young-adult book.

Queen of the Undead


[update: I don’t know why I read 魔装 as 魔法; I guess I just assumed it was 魔法少女, and didn’t realize that it’s a created word, masou, with a meaning like “dressed as a witch” (from 和装 “dressed Japanese-style” and similar)]

A little something recommended by Amazon: 「これはゾンビですか?」 volume 1, 「はい、魔装少女です」.

Is this a Zombie?

Mt. Fuji from Misaki Port, Miura Peninsula, Kanagawa, Japan


Indirectly, this picture comes from my sister:

Mt. Fuji from Misaki Port, Miura Peninsula, Kanagawa, Japan

(from Panoramio user とも21, who has a site devoted to taking pictures of Fuji)

How I found it:

  1. Nellie called me a while back to get reacquainted.
  2. I flew out to Chicago to spend some time with her.
  3. She took me to a used book store.
  4. In the basement, I found a Japanese paperback book with an interesting title, も一つパイプのけむり (Another 'Pipe Smoke').
  5. It was written by famous composer Ikuma Dan, and contained a number of short essays.
  6. Many of them looked too challenging to take to my reading class, but since I missed class this week, I'm emailing the teacher a written translation.
  7. The fourth essay is titled 霧 (Fog). He writes about moving to a new house on the Miura Peninsula, along the Misaki Highway, and mentions that one of the things he can see is Mt. Fuji.
  8. I looked the area up in Google Earth and clicked on some pictures. This was the fourth picture I clicked.

“Need a clue, take a clue,
 got a clue, leave a clue”