“Emacs itself was one of about half-a-dozen dispatch-vector-driven editors developed circa 1971-1972, and is known to the world at large primarily because it absorbed the functionality of all the others before one of them could successfully absorb it. Emacs has been much like an amoeba from the very beginning.”

— Lum Johnson

Abbyy FineReader Pro 9.0, quick tests


I’ve gotten pretty good at transcribing Japanese stories and articles, using my DS Lite and Kanji sonomama to figure out unfamiliar kanji words, but it’s still a slow, error-prone process that can send me on half-hour detours to figure out a name or obsolete character. So, after googling around for a while, I downloaded the free 15-day demo of FineReader Pro and took it for a spin. Sadly, this is Windows software, so I had to run it in a VMware session; the only product that claims to have a kanji-capable Mac product has terrible reviews and shows no sign of recent updates.

First test: I picked up a book (Nishimura’s murder mystery collection Ame no naka ni shinu), scanned a two-page spread at 600-dpi grayscale, and imported it into FineReader. I had to shut off the auto-analysis features, turn on page-splitting, and tell it the text was Japanese. It then correctly located the two vertically-written pages and the horizontally-written header, deskewed the columns (neither page was straight), recognized the text, and exported to Word. Then I found the option to have it mark suspect characters in the output, and exported to Word again. :-)

Results? Out of 901 total characters, there were 10 errors: 6 cases of っ as つ, one あ as ぁ, one 「 as ー, one 呟 as 眩, and one 駆 recognized as 蚯. There were also two extra “.” inserted due to marks on the page, and a few places where text was randomly interpreted as boldface. Both of the actual kanji errors were flagged as suspect, so they were easy to find, and the small-tsu error is so common that you might as well check all large-tsu in the text (in this case, the correct count should have been 28 っ and 4 つ). It also managed to locate and correctly recognize 3 of the 9 instances of furigana in the scan, ignoring the others.

I’d already typed in that particular section, so I diffed mine against theirs until I had found every error. In addition to FineReader’s ten errors, I found two of mine, where I’d accepted the wrong kanji conversion for words. They were valid kanji for those words, but not the correct ones, and multiple proofreadings hadn’t caught them.

The second test was a PDF containing scanned pages from another book, whose title might be loosely translated as “My Youth with Ultraman”, by the actress who played the female team member in the original series. I’d started with 600-dpi scans, carefully tweaked the contrast until they printed cleanly, then used Mac OS X Preview to convert them to a PDF. It apparently downsampled them to something like 243 dpi, but FineReader was still able to successfully recognize the text, with similar accuracy. Once again, the most common error was small-tsu, the kanji errors were flagged as suspect, and the others were easy to find.

For amusement, I tried Adobe Acrobat Pro 9.1’s language-aware OCR on the same PDF. It claimed success and looked good on-screen, but every attempt to export the results produced complete garbage.

Both tests were nearly best-case scenarios, with clean scans, simple layouts, and modern fonts at a reasonable size. I intend to throw some more difficult material at it before the trial expires, but I’m pretty impressed. Overall, the accuracy was 98.9%, but when you exclude the small-tsu error, it rises to 99.6%, and approaches 99.9% when you just count actual kanji errors.

List price is $400, but there’s a competitive upgrade available for customers with a valid license for any OCR software for $180. Since basically every scanner sold comes with low-quality OCR software, there’s no reason for most people to spend the extra $220. They use an activation scheme to prevent multiple installs, but it works flawlessly in a VMware session, so even if I didn’t own a Mac, that’s how I’d install it.

[updates after the jump]

more...

How to sell comic books...


…in Japan. Not precisely safe for work…

more...

Dear Apple,


When you mark a bug “closed as a duplicate of bug X”, it would be nice if I were able to actually see bug X. Apparently, if I want to know the status of a fix, I have to send email saying “I reported Y, can you tell me if there’s been any action on X?”.

In this case, X has ID 5647954 and Y is 6770720, suggesting that X has been gathering dust for quite a while, and is unlikely to be fixed in an upcoming release. Japanese keyboard support in general seems to be pretty dusty, and I doubt you’ll get them working with Boot Camp any time soon, either.

[yes, this is because XP and Vista are stupid about keyboard layouts, and it affects VMware, too, but so what? You wrote the drivers for Boot Camp, and tout it as a feature, and it doesn’t work with some of your keyboards.]

The problem with "Letterman's rape joke"...


…is that he never made one. In all the outraged coverage of the incident, you think someone would have bothered to mention that little nugget of information.

I watched the clip everyone’s linking to. He made a joke that implied that a player for the Yankees (who is good-looking, quite successful, and has been caught fooling around in the past) left the field in the middle of the game and had sex with Sarah Palin’s daughter. The exact words were “her daughter was knocked up by Alex Rodriguez”. Not a single word about rape, and, in fact, one could easily reverse the outrage by pointing out that these people are insisting that sex between a light-skinned female and a dark-skinned male must be rape.

Letterman and his writers obviously thought they were talking about Palin’s adult daughter, an unwed mother who was knocked up by an athlete. They’re guilty of being too lazy to check which daughter attended the game, or perhaps of not even knowing that there was another daughter. But that’s all.

If you happened to know (as Letterman and his equally-clueless writers obviously did not) that the daughter at the game was 14 years old, you could interpret it as a statutory rape joke, but I haven’t seen anyone say that. Unless something’s been edited out of the clip (and I got it from a site that was feeling the rage), there’s just no rape in this “rape joke”.

Dear Apple,


Why does Mail.app keep segfaulting in this method call:

[MetadataManager getAllCalendarStoreData]

I’ve turned off data detectors, rebuilt my iCal database, rebuilt my Mail indexes, and pretty much everything else I can think of, and it still crashes anywhere between 2 minutes and one hour after I start it up.

Mind you, I have no idea why your email client is importing all of my calendars in the first place…

[Update: various forum posts suggest that this is tied to Leopard’s merger of iCal to-do list functionality into Mail, which works by syncing your local to-do lists up with your IMAP server. Except that I don’t use iCal for to-do lists, and wouldn’t want them on my mail servers if I did. So, a feature I’ve never used that does something I don’t want has inexplicably started causing my email to crash at random intervals, and since the bug has been around since at least 10.5.2, it’s unlikely to be fixed deliberately. One can only hope that there’s enough mail-related cleanup in Snow Leopard that it starts working there…]

Riddle me this...


So, in a story about a well-placed State Department official on trial for spending the last 30 years spying for Cuba, what sort of direct quote do they lead off with?

"We were all appalled by the Bush years"

Because, y’know, that puts everything in perspective. If proven guilty, what we have here is someone who turned traitor because he started hating America during the Carter administration, but somehow, it’s still all about Bush. Fits the established narrative better, y’see.

Whatever happened to Stone Clouds?


Every once in a while, I’d visit the old Radioactive Panda site and see if there was any word on Eric Johnson’s next comic. The answer was always no (in the form of deafening silence, unless you visited the forums), but he has now returned with an official update, revaling a new start date and the reason for his three-year absence: respectively, “August 2009” and “World of Warcraft”.

Yeah, I can see that.

Engrish Pop Quiz


I saw something at Daiso a while back that I thought would make an amusing gift for my sister. On the back was found this label:

Caution: Engrish In Use

Now, what’s the product?

more...

“Need a clue, take a clue,
 got a clue, leave a clue”