August 2013

Eternal sweetness on Amazon

One kilo of pure Sucralose powder, for ~$200.

This is either a lifetime supply, or a lifetime supply, much like the kilo of pure caffeine, which is about a hundred lethal doses.


(via the NSFW BC Ikusani)

Yomitori 1.0

About two and a half years ago, I threw together a set of Perl scripts that converted Japanese novels into nicely-formatted custom student editions with additional furigana and per-page vocabulary lists. I said I’d release it, but the code was pretty raw, the setup required hacking at various packages in ways I only half-remembered, and the output had some quirks. It was good enough for me to read nearly two dozen novels with decent comprehension, but not good enough to share.

When I ran out of AsoIku novels to read, I decided it was time to start over. I set fire to my toolchain, kept only snippets of the old code, and made it work without hacking on anyone else’s packages. Along the way, I switched to a much better parsing dictionary, significantly improved lookup of phrases and expressions, and made the process Unicode-clean from start to finish, with no odd detours through S-JIS.

Still some work to do (including that funny little thing called “documentation”…), but it makes much better books than the old one, and there are only a few old terrors left in the code. So now I’m sharing it.

Carve it in stone, that it shall never be forgotten

"If you don't have the social skills to phrase a polite question, Slashdot is perhaps not the ideal place to go looking for advice..."

(via, where the person quoted is actually answering the wrong question…)

Yomitori sample output

I’ve made quite a few improvements since putting the code up on Github. Just having it out in public made me clean it up a lot, but trying to produce a decent sample made an even bigger difference. QAing the output of my scripts has smoked out a number of errors in the original texts, as well as some interesting errors and inconsistencies in Unidic and JMdict.

The sample I chose to include in the Github repo is a short story we went through in my group reading class, Lafcadio Hearns’ Houmuraretaru Himitsu (“A Dead Secret”). The PDF files (text and vocab) are designed to be read side-by-side (I use two Kindles, but they fit nicely on most laptop screens), while the HTML version uses jQuery-based tooltips to show the vocabulary on mouseover.

For use as a sample, I left in a lot of words that I know. If I were generating this story for myself, I’d use a much larger known-words list.

For future reference...

…the black cable controls fan speed. I’ll need this information again soon.

Girls and guns

So, in addition to Preferential Measure Organization Stella Women’s Academy High School Division Class C3 (which apparently flopped as a manga, but is one of the few watchable anime this season), there’s at least one other schoolgirl survival-game manga running, this one successfully: Survival Game Club.

I had happened to pick up a volume of this one when I was Osaka, and while the general idea is the same, this one opens a bit differently, with New Girl At School getting her introduction to the world of survival games by being rescued from a train molester by a stylishly-dressed but mildly insane blonde schoolgirl carrying what looks like a Beretta 9mm pistol. …who is promptly hauled off by the station employees, to her loud protests. Of course, this is the beginning of a new life for Our Heroine, in more of a wacky-antics universe than C3.

I was reminded of this series when I happened to click the Google translate button while looking at reviews of C3, and saw a more hilarious than usual mistranslation. The official title of Survival Game Club is さばげぶっ! (saba-ge-bu). Google helpfully translated this as “Sabage Bukkake!”. I can only be grateful that they didn’t go with “savage”.

Yomitori expressions

Mecab/Unidic has fairly strict ideas about morphemes. For instance, in the older Ipadic morphological dictionary,日本語 is one unit, “Nihongo = Japanese language”, while in Unidic, it’s 日本+語, “Nippon + go = Japan + language”. This has some advantages, but there are two significant disadvantages, both related to trying to look up the results in the JMdict J-E dictionary.

First, I frequently need to concatenate N adjacent morphemes before looking them up, because the resulting lexeme may have a different meaning. For instance, Unidic considers the common expression 久しぶりに to consist of the adjective 久しい, the suffix 振り, and the particle に. It also thinks that the noun 日曜日 is a compound of the two nouns 日曜 and 日, neglecting the consonant shift on the second one.

Second, there are a fair number of cases where Unidic considers something a morpheme that JMdict doesn’t even consider a distinct word. For instance, Unidic has 蹌踉き出る as a single morpheme, while JMdict and every other dictionary I’ve got considers it to be the verb 蹌踉めく (to stagger) plus the auxiliary verb 出る (to come out). The meaning is obvious if you know what 蹌踉めく means, but I can’t automatically break 蹌踉き出る into two morphemes and look them up separately, because Unidic thinks it’s only one.

To fix the second problem, I’m going to need to add a bit of code to the end of my lookup routines that says, “if I still haven’t found a meaning for this word, and it’s a verb, and it ends in a common auxiliary verb, then strip off the auxiliary, try to de-conjugate it, and run it through Mecab again”. I haven’t gotten to this yet, because it doesn’t happen too often in a single book.

To fix the first problem, I start by trying to lookup an entire clause as a single word, then trim off the last morpheme and try again, repeating until I either get a match or run out of morphemes. I had built up an elaborate and mostly-successful set of heuristics to prevent false positives, but they had the side effect of also eliminating many perfectly good compounds and expressions. And JMdict actually has quite a few lengthy expressions and set phrases, so while half-assed heuristics were worthwhile, making them better would pay off quickly.

Today, while working on something else entirely, I realized that there was a simple way (conceptually simple, that is; implementation took a few tries) to eliminate a lot of false positives and a lot of the heuristics: pass the search results back through Mecab and see if it produces the same number of morphemes with the same parts of speech.

So, given a string of morphemes like: (い, て, も, 立っ, て, も, い, られ, ない, ほど, に, なっ, て, いる), on the sixth pass I lookup いても立ってもいられない and find a match (居ても立っても居られない, “unable to contain oneself”) that breaks down into the same number of morphemes of the same type. There’s still a chance it could have chosen a wrong word somewhere (and, in fact, it did; the initial い was parsed as 行く rather than 居る, so a stricter comparison would have failed), but as a heuristic, it works much better than everything else I’ve tried, and has found some pretty impressive matches:

  • 口が裂けても言えない (I) won't say anything no matter what
  • たまったものではない intolerable; unbearable
  • 痛くもかゆくもない of no concern at all; no skin off my nose
  • 火を見るより明らか perfectly obvious; plain as daylight
  • 言わんこっちゃない I told you so
  • どちらかと言えば if pushed I'd say
  • と言えなくもない (one) could even say that
  • 似ても似つかない not bearing the slightest resemblance
  • 痛い目に遭わせる to make (a person) pay for (something)
  • と言って聞かない insisting
  • 聞き捨てならない can't be allowed to pass (without comment)

I’ve updated the samples I created a few weeks ago to use the new parsing. Even on that short story, it makes a few sections clearer.

[Update: one of the rare false positives from this approach: 仲には and 中には break down the same, and since the second one is in JMdict as an expression, doing a kana-only lookup on なかには will falsely apply the meaning “among (them)” to 仲には. Because of variations in orthography, I have to allow for kana-only lookups, especially for expressions, but fortunately this sort of false match is rare and easy to catch while reading.]

Why did Bradley Manning suddenly become Chelsea?

Why make a big fuss about announcing what everyone knew well before the trial started? Because Leavenworth is an all-male prison. This makes it look less like a courageous stance by a transgender individual, and more like a cynical ploy to avoid spending the next 7-35 years in Leavenworth.

(cynical quotes around certain words in the previous paragraph have been omitted to avoid discussing the general issue of gender as a fluid concept disconnected from biology)

Thinking bad thoughts

Perhaps I’ve been away from teenage girls for too long, but my automatic reaction to this product suggests I should be kept away from them…

Duct tape? At their age?

"Shut up and take my money"

Jeff Atwood and Weyman Kwong are making a sturdy programmer’s keyboard with silent mechanical keyswitches. While I enjoy the ear-shattering clatter of my current mechanical keyboad, I’m less fond of the shoddy physical construction and poor multi-keypress handling (and, of course, I’d swallow broken glass before dealing with the assholes at Matias ever again), so this is definitely on my must-buy list.

Name that warrior...

“If it weren’t for her plump breasts, most opponents would probably mistake her for a man. Therefore she dressed daringly, exposing most of her flesh. Her skin was swarthy, with a strange design inked on her left cheek. An informed observer would recognize the design as a warding spell of the Arido hill tribes.”

That settles the question of why competent women in fantasy show so much skin: they’re feminist pioneers, working twice as hard to prove they’re just as good as men!

[this message brought to you by my attempt to figure out the word 呪払い, which apparently should be read as “noroibarai”; it’s not explained anywhere, but is in common use in the online fantasy community to refer to warding magic. See also here.]

“Need a clue, take a clue,
 got a clue, leave a clue”