Tools

Ah, Emacs...


Emacs 23 natively uses Unicode. This means I can run it in a Terminal window, like God intended, and still have full Japanese support. Previous versions did funky Shift-JIS conversions that made its behavior… “eccentric” on a Mac, especially with cut-and-paste.

Now all I have to do is strip out all of the cruft from the elisp directory, and I’ll have the perfect text editor. Actually, it’ll be easier to delete everything and just add back the non-cruft as needed. There’s not much that I don’t consider cruft, so it will be pretty darn small.

[side note: a release comment says something to the effect that the internal encoding is a superset of Unicode with four times the space, which would make it a 34-bit system. WTF? Update: ah, I see; UTF-32 has a lot of empty space, with only a bit over 20 bits allocated in the Unicode standard. UTF-8 was also designed with considerable headroom, which is no surprise, given that it was invented during dinner by Ken Thompson.]

Dear Bryan O'Sullivan,


Here’s your definitive manual’s complete comparison of Perforce to Mercurial:

Perforce has a centralised client/server architecture, with no client-side caching of any data. Unlike modern revision control tools, Perforce requires that a user run a command to inform the server about every file they intend to edit.

The performance of Perforce is quite good for small teams, but it falls off rapidly as the number of users grows beyond a few dozen. Modestly large Perforce installations require the deployment of proxies to cope with the load their users generate.

In order, I say, “bullshit”, “feature”, “buy a server, dude”, and “you’re doing it wrong”.

In fairness, the author admits up front that his comments about other tools are based only on his personal experience and biases, and the inline comments for this section point out its flaws. Still, it’s clear that his personal experience with Perforce was… limited. Also, he’s either not aware of the features it has that Mercurial lacks, or simply discounts them as “not relevant to the way Our Kind Of People work”.

I’m not criticizing the tool itself, mind you; I’ve tried out several distributed SCMs in the past few years, and Mercurial seems to be fast, stable, easily extensible, and well-supported. I’m switching several of my Japanese projects to it from Bazaar, and it cleanly imported them. It handles Unicode file names and large files a lot better, which were causing me grief in the other tool.

There are things I can’t do in Mercurial that I do in Perforce, though, and some of them will likely never be possible, given the design of the tool. [Update: for-instance deleted; it appears that if you always use the -q option to hg status, you avoid walking the file system, and you can set it as a default option on a per-repository basis. If the rest of the commands play nice, that will work. The real value of explicit checkouts, even in that example, is the information-sharing, something that devs often value less than Operations does.]

ORK4


Just for amusement…

ORK4 QRcode

Ding!


Vlc hits version 1.0. Now they can start working on the user interface!

Using Abbyy FineReader Pro for Japanese OCR


[Update: if you save your work in the Finereader-specific format, then changes you make after that point will automatically be saved when you exit the application; this is not clear from the documentation, and while it’s usually what you want, it may lead to unpleasant surprises if you decide to abandon changes made during that session.]

After several days with the free demo, in which I tested it with sources of varying quality and tinkered with the options, I bought a license for FineReader Pro 9.0 (at the competitive upgrade price for anyone who owns any OCR product). I then spent a merry evening working through an album where the liner notes were printed at various angles on a colored, patterned background. Comments follow:

  • Turn off all the auto features when working with Japanese text.
  • In the advanced options, disable all target fonts for font-matching except MS Mincho and Times New Roman. Don't let it export as MS Gothic; you'll never find all of the ー/一 errors.
  • Get the cleanest 600-dpi scan you can. This is sufficient for furigana-sized text on a white background.
  • Set the target language to Japanese-only if your source is noisy or you're sure there's no random English in the text. Otherwise, it's safe to leave English turned on.
  • Manually split and deskew pages if the separation isn't clean in the scan.
  • Adjust the apparent resolution of scans to set the output font size, before you tell it to recognize the text.
  • Manually draw recognition areas if there's anything unusual about your layout.
  • Rearrange the windows to put the scan and the recognized text side-by-side.
  • Don't bother with the spell-checker; it offers plausible alternative characters based on shape, but if the correct choice isn't there, you have to correct it in the main window anyway. Just right-click as you work through the document to see the same data in context.
  • You can explicitly save in a FineReader-specific format that preserves the entire state of your work, but it creates a new bundle each time, and it won't overwrite an existing one with the same name. This makes it very annoying when you want to simply save your progress as you work through a long document; each new save includes a complete copy of the scans, which adds up fast.
  • If you figure out how to get it stop deleting every full-width kanji whitespace character, let me know; it's damned annoying when you're trying to preserve the layout of a song.
  • Once you've told it to recognize the text, search the entire document for these common errors:
    • っ interpreted as つ and vice-versa
    • ー interpreted as 一 and vice-versa; check all other nearby katakana for "small-x as x" errrors while you're at it
    • 日 interpreted as 曰
    • Any English-style punctuation other than "!", ":", "…", or "?"; most likely, they should be the katakana center-dot, but it might have torn a character apart into random fragments (rare, unless your background is noisy).
    • The digits 0-9; if your source is noisy, random kanji and kana can be interpreted as digits, even when English recognition is disabled.
  • Delete any furigana it happens to recognize, unless you're exporting to PDF; it just makes a mess in Word.
  • In general, export to Word as Formatted Text, with the "Keep line breaks" and "Highlight uncertain characters" options turned on.
  • If your text is on a halftoned background and you're getting a lot of errors, load up the scan in Photoshop, use the Strong Contrast setting in Curves, then try out the various settings under Black & White until you find one that gets rid of most of the remaining halftone dots (I had good luck with Neutral Density). After that, you can Despeckle to get rid of most of the remaining noise, and use Curves again to force the text to a solid black.

Abbyy FineReader Pro 9.0, quick tests


I’ve gotten pretty good at transcribing Japanese stories and articles, using my DS Lite and Kanji sonomama to figure out unfamiliar kanji words, but it’s still a slow, error-prone process that can send me on half-hour detours to figure out a name or obsolete character. So, after googling around for a while, I downloaded the free 15-day demo of FineReader Pro and took it for a spin. Sadly, this is Windows software, so I had to run it in a VMware session; the only product that claims to have a kanji-capable Mac product has terrible reviews and shows no sign of recent updates.

First test: I picked up a book (Nishimura’s murder mystery collection Ame no naka ni shinu), scanned a two-page spread at 600-dpi grayscale, and imported it into FineReader. I had to shut off the auto-analysis features, turn on page-splitting, and tell it the text was Japanese. It then correctly located the two vertically-written pages and the horizontally-written header, deskewed the columns (neither page was straight), recognized the text, and exported to Word. Then I found the option to have it mark suspect characters in the output, and exported to Word again. :-)

Results? Out of 901 total characters, there were 10 errors: 6 cases of っ as つ, one あ as ぁ, one 「 as ー, one 呟 as 眩, and one 駆 recognized as 蚯. There were also two extra “.” inserted due to marks on the page, and a few places where text was randomly interpreted as boldface. Both of the actual kanji errors were flagged as suspect, so they were easy to find, and the small-tsu error is so common that you might as well check all large-tsu in the text (in this case, the correct count should have been 28 っ and 4 つ). It also managed to locate and correctly recognize 3 of the 9 instances of furigana in the scan, ignoring the others.

I’d already typed in that particular section, so I diffed mine against theirs until I had found every error. In addition to FineReader’s ten errors, I found two of mine, where I’d accepted the wrong kanji conversion for words. They were valid kanji for those words, but not the correct ones, and multiple proofreadings hadn’t caught them.

The second test was a PDF containing scanned pages from another book, whose title might be loosely translated as “My Youth with Ultraman”, by the actress who played the female team member in the original series. I’d started with 600-dpi scans, carefully tweaked the contrast until they printed cleanly, then used Mac OS X Preview to convert them to a PDF. It apparently downsampled them to something like 243 dpi, but FineReader was still able to successfully recognize the text, with similar accuracy. Once again, the most common error was small-tsu, the kanji errors were flagged as suspect, and the others were easy to find.

For amusement, I tried Adobe Acrobat Pro 9.1’s language-aware OCR on the same PDF. It claimed success and looked good on-screen, but every attempt to export the results produced complete garbage.

Both tests were nearly best-case scenarios, with clean scans, simple layouts, and modern fonts at a reasonable size. I intend to throw some more difficult material at it before the trial expires, but I’m pretty impressed. Overall, the accuracy was 98.9%, but when you exclude the small-tsu error, it rises to 99.6%, and approaches 99.9% when you just count actual kanji errors.

List price is $400, but there’s a competitive upgrade available for customers with a valid license for any OCR software for $180. Since basically every scanner sold comes with low-quality OCR software, there’s no reason for most people to spend the extra $220. They use an activation scheme to prevent multiple installs, but it works flawlessly in a VMware session, so even if I didn’t own a Mac, that’s how I’d install it.

[updates after the jump]

more...

Things that suck


  1. 10% packet loss on my DSL line.
  2. Three hours diagnosing the problem so that I could convince tech support it wasn't my equipment. And, yes, I even had a spare DSL modem lying around.
  3. At least four hours spent on the phone with support at various levels, mostly spent listening to Muzak and repeating parts of item #2.
  4. Being told that it will be 11 days before someone can physically come out and check the lines, since resetting the DSLAM didn't fix it.
  5. Discovering that every other service provider in the area (cable, wireless, etc) has at least a 5-day lead time, and juicy up-front costs for the required gear.

Dictionary update


[Update update: I’ve made a small change to add the full JMnedict name dictionary; a lot of things that used to be in Edict/JMdict have been moved over to this much-larger secondary dictionary, and I finally got around to integrating it. The English translations aren’t searchable yet, mostly because I need to rework the form and add the kanji dictionary to Xapian as well, so that I have J↔E, N↔E, and K↔E.]

One downside of moving a lot of stuff onto my new shared-hosting account is that I have to give up a lot of control over what’s running. Not only do I have to work through an Apache .htaccess file instead of reconfiguring the server directly, but I can’t run my own servers on their machine.

So, goodbye Sphinx search engine, hello Xapian (thanks, Pixy). While it suffers from a lack of documentation between “baby’s first search” and “211-page C++ API document”, it has a lot to offer, and doesn’t require a server. One thing it has is a full-featured query parser, so you can create searches like “pos:noun usage:common lunch -keyword:vulgar” to get common lunch-related nouns that don’t include sexual slang (such as the poorly-attributed usage of ekiben as a sexual position). That allows me to use the same tagging for the E-J searches that I use in Sqlite for the J-E searches. [note: everything’s just filed under “keyword:” in this first pass, and the valid values are the same as the advanced-search checkboxes]

I need a full-text search to do English-Japanese, because the JMdict data isn’t really designed for it. There are hooks in the XML schema, but they’re not used yet. As a result, my search results are a bit half-assed, which makes the new query support useful for refining the results. I can also split out the French, German, and Russian glosses into their own correctly-stemmed searches; with Sphinx, there was one primary body field to search, so all the glosses were lumped together. With a small code change, I can tag each gloss with the correct ISO language code and index them correctly.

The new version is now live on jgreely.net/dict, which means I should be able to move that domain over to the shared-hosting account soon.

Once I figured out how to use Xapian (through the Search::Xapian Perl module, of course), replacing Sphinx and adding the keyword support took a few minutes and maybe half a page of code, total. In theory, I could use it for the J-E searches as well, but I’d lose the ability to put wildcards anywhere in the search string, which comes in handy when I’m trying to track down obscure or obsolete words.

One thing I haven’t figured out is why I can’t use add_term with kanji arguments; both Xapian and Perl are working entirely in Unicode, but passing non-ASCII arguments to add_term throws an error. The workaround is to set the stemmer to “none” and use index_text, and that’s fast enough that I don’t need to worry about it right now.

The most annoying thing about the Xapian documentation is how well-hidden the prefix support is. The details aren’t in the API at all; you can learn how to add them to a term generator or query parser, but the really useful explanation is over in the Omega docs.

“Need a clue, take a clue,
 got a clue, leave a clue”