I’ve gotten pretty good at transcribing Japanese stories and articles, using my DS Lite and Kanji sonomama to figure out unfamiliar kanji words, but it’s still a slow, error-prone process that can send me on half-hour detours to figure out a name or obsolete character. So, after googling around for a while, I downloaded the free 15-day demo of FineReader Pro and took it for a spin. Sadly, this is Windows software, so I had to run it in a VMware session; the only product that claims to have a kanji-capable Mac product has terrible reviews and shows no sign of recent updates.
First test: I picked up a book (Nishimura’s murder mystery collection Ame no naka ni shinu), scanned a two-page spread at 600-dpi grayscale, and imported it into FineReader. I had to shut off the auto-analysis features, turn on page-splitting, and tell it the text was Japanese. It then correctly located the two vertically-written pages and the horizontally-written header, deskewed the columns (neither page was straight), recognized the text, and exported to Word. Then I found the option to have it mark suspect characters in the output, and exported to Word again. :-)
Results? Out of 901 total characters, there were 10 errors: 6 cases of っ as つ, one あ as ぁ, one 「 as ー, one 呟 as 眩, and one 駆 recognized as 蚯. There were also two extra “.” inserted due to marks on the page, and a few places where text was randomly interpreted as boldface. Both of the actual kanji errors were flagged as suspect, so they were easy to find, and the small-tsu error is so common that you might as well check all large-tsu in the text (in this case, the correct count should have been 28 っ and 4 つ). It also managed to locate and correctly recognize 3 of the 9 instances of furigana in the scan, ignoring the others.
I’d already typed in that particular section, so I diffed mine against theirs until I had found every error. In addition to FineReader’s ten errors, I found two of mine, where I’d accepted the wrong kanji conversion for words. They were valid kanji for those words, but not the correct ones, and multiple proofreadings hadn’t caught them.
The second test was a PDF containing scanned pages from another book, whose title might be loosely translated as “My Youth with Ultraman”, by the actress who played the female team member in the original series. I’d started with 600-dpi scans, carefully tweaked the contrast until they printed cleanly, then used Mac OS X Preview to convert them to a PDF. It apparently downsampled them to something like 243 dpi, but FineReader was still able to successfully recognize the text, with similar accuracy. Once again, the most common error was small-tsu, the kanji errors were flagged as suspect, and the others were easy to find.
For amusement, I tried Adobe Acrobat Pro 9.1’s language-aware OCR on the same PDF. It claimed success and looked good on-screen, but every attempt to export the results produced complete garbage.
Both tests were nearly best-case scenarios, with clean scans, simple layouts, and modern fonts at a reasonable size. I intend to throw some more difficult material at it before the trial expires, but I’m pretty impressed. Overall, the accuracy was 98.9%, but when you exclude the small-tsu error, it rises to 99.6%, and approaches 99.9% when you just count actual kanji errors.
List price is $400, but there’s a competitive upgrade available for customers with a valid license for any OCR software for $180. Since basically every scanner sold comes with low-quality OCR software, there’s no reason for most people to spend the extra $220. They use an activation scheme to prevent multiple installs, but it works flawlessly in a VMware session, so even if I didn’t own a Mac, that’s how I’d install it.
[updates after the jump]
I saw something at Daiso a while back that I thought would make an amusing gift for my sister. On the back was found this label:
Now, what’s the product?
Kusumi: “…and I’m too sexy for my hat, Too sexy for my hat, what do you think about that?”
I’ve heard mixed reports about the various companies that act as reshipping agents in Japan, allowing you to order from companies who only ship domestically. Danny Choo recently had a prominent link to Tenso, along with a contest where the prize included shopping and shipping. There were relatively few comments about the service, but they were positive. I haven’t found many other comments about them, either, but I thought I’d give it a shot.
Several times a year, I place large orders with Amazon Japan. They only ship by air, so the order needs to be large to bring the shipping cost down to a less heartbreaking percentage of the price. They won’t ship software or consumer electronics internationally, and the marketplace vendors won’t ship anything overseas, so it’s been a limited-but-useful way to get stuff.
Tenso ships EMS, charging by weight, and in some cases this may end up being higher than Amazon’s air shipping; now that I have a few invoices to compare, I’ll have to figure out when it makes sense to use them for new goods, figuring in the cost of Amazon Prime to get free domestic shipping to their warehouse. For this test, I took advantage of Amazon’s free one-month trial of Prime. [Update: found Amazon’s rate page again; ¥1700-2700 shipping depending on the contents, plus a fixed ¥300 handling per item]
For used goods? No contest. A lot of marketplace dealers charge a nominal ¥1 + handling for used books and CDs that aren’t in high demand, and I found a single dealer who had three items that I wanted, -azb-アマゾン店. Their handling charge added ¥1020, and Tenso charged ¥2350 for shipping and handling, for a grand total of ¥3373. The original retail cost of the three items? ¥4819. Speed? I ordered on the 12th, Tenso received it on the 16th, shipped it on the 17th, and it was waiting for me at the office today, the 21st. It may actually have arrived yesterday; I was out.
Setting up my account with Tenso was easy, except for the credit card. Neither my Visa nor my Mastercard were accepted, despite having used both with Amazon, but my American Express card worked fine. The error page for this was the only place I noticed where their mostly-competent English was replaced by Japanese, and some of Danny Choo’s commenters reported the same difficulty, and ended up using Paypal. They give clear instructions on how to enter your personal address on online stores, and promptly notify you when you need to approve a shipment.
Will I use them again? Definitely for used goods through Amazon, likely for software/games that Amazon won’t ship directly, possibly for other stores if I find something I really want.
Were the three used items worth it? Hell, yeah.
For the first time in a long time, I had to pull out my other electronic dictionary. Why? Because the short essay I was trying to read was filled with place-names. On the DS, I had to write one character at a time, hope it was used at the start of some word (Kanji Sonomama doesn’t have a true kanji dictionary), and then type each one in on my Mac and look them up in Enamdict.
My other dictionary, a Sharp Papyrus, has clumsier stylus input and a generally less useful interface, but a much wider variety of dictionaries, including names and places.
Even with both handhelds and my JMdict search tools, it’s still a tough slog, because Ikuma Dan writes in colorful, literary language, using pre-war orthography. For instance, 眼 for “me”, 筈 for “hazu”, 儘 for “mama”, 未だ for “mada”, 又 for “mata”, and my favorite, 何處 for “doko” (處 being an obsolete variant of 処).
It’s been an interesting experience, but one I won’t repeat any time soon; in the time it takes me to decipher two pages of his writing, I could read thirty pages of a children’s or young-adult book.
[update: I don’t know why I read 魔装 as 魔法; I guess I just assumed it was 魔法少女, and didn’t realize that it’s a created word, masou, with a meaning like “dressed as a witch” (from 和装 “dressed Japanese-style” and similar)]
A little something recommended by Amazon: 「これはゾンビですか?」 volume 1, 「はい、魔装少女です」.
Indirectly, this picture comes from my sister:
(from Panoramio user とも21, who has a site devoted to taking pictures of Fuji)
How I found it: