Web

A for effort, guys, but...


…I don’t sit twenty feet away from my laptop, and the subject line gives it away as spam anyway:

more...

Please, crash for me


I have never been more annoyed at an application’s failure to fail.

We have this service daemon that performs various actions on incoming images. Recently, it’s been crashing at random intervals, leaving behind a core file that tells us precisely what function it segfaulted in, but includes nothing to tell us where the image came from. All we know is that somewhere out on the Internet, there are JPEG images that crash our copy of the IJG JPEG library in jpeg_idct_ifast().

Since this was affecting customer performance, we really wanted to know, so we cranked up the logging on one of the affected thirty-two servers, to capture the incoming request URLs. And it hasn’t crashed since.

Four days of crashes every hour or so, and now nothing. The good news is that our customers are less unhappy. The bad news is that our developers don’t have a test case to code a fix against.

So now I’m trolling the web, looking for corrupt JPEGs. I strongly suspect that the images that caused our problem were intended to exploit holes in a certain other OS, but I can’t be sure until I find some and feed them to our server. Sigh.

Comment spam


Someone finally got around to automating a comment-spamming tool that evaded my trivial protections (rename MT CGI scripts, force preview before post). Naturally, they decided to send six different comments to three or four different articles, about a dozen times each.

Sadly for them, they put their web site into the commenter’s URL field, which I don’t display, so their efforts were in vain. Even worse, from their point of view, they sent them all from the same IP address, which meant it took about thirty seconds to clean things up. And another five to ban their entire netblock at the firewall. I didn’t even need to rebuild, since the comment pages aren’t cached (another trivial change from the defaults).

I think for the next pass, I’ll change the comment URL from /mt/hasturhasturhastur to /murfle/gleep. The best defense against automation is diversity.

My Evil Twin


I didn’t know I had one, but then he ordered some Mac stuff from a Yahoo store and accidentally entered my .Mac email address instead of his very similar one. Since the shipping and billing addresses were in Boca Raton, Florida, and I’m in California, this looked an awful lot like identity theft, which makes for a lovely way to spend a Friday evening. After calling all of my credit-card vendors to check for suspicious charges, changing several passwords, and other financial fire-drilling, I thought to look up the phone numbers from the invoice with anywho. Sure enough, there’s a Jay Greely in Boca Raton, and he lives at that address.

Who knew?

Update: Just talked to Jay’s wife, and it turns out that they bought their first Mac yesterday, and he apparently misremembered their shiny new .Mac email address.

"Hey, where did the pictures go?"


I took the munitions.com web site down for the night. We’re trying to diagnose an odd TCP error that keeps some people from seeing any of my sites, and the current suspect is the packet filter.

Of course, no packet filter means no bandwidth throttling, and no bandwidth throttling means that all those pictures of happy smiling Playboy models get downloaded a lot more. This gets expensive rather quickly…

Update: back now. Definitely something in either PF or my ruleset that’s interacting badly with Fedora’s latest update to TCP window scaling. The only thing I can think of is the scrub rule, so I’ve commented it out for now.

Why I prefer Perl to JavaScript, reason #3154


For amusement, I decided that my next Dashboard gadget should be a tool for looking up characters in KANJIDIC using Jack Halpern’s SKIP system.

SKIP is basically a hash-coding system for ideographs that doesn’t rely on extensive knowledge of how they’re constructed. Once you’ve figured out how to count strokes reliably, you simply break the character into two parts according to one of several patterns, and count the number of strokes in each part. It’s not quite that simple, but almost, and it’s a lot more novice-friendly than traditional lookup methods.

Downside? The simplicity of the system results in a large number of hash collisions (only 584 distinct SKIP codes for the 6,355 characters in KANJIDIC). In the print dictionaries the system was designed for, this is handled by grouping together entries that share the same first part. Conveniently, unicode sorting seems to produce much the same effect, although a program can’t identify the groups without additional information. A simple supplementary index can easily be constructed for the relatively few SKIP codes with an absurd collision count (1-3-8 is the biggest, at 161), so it’s feasible to create a DHTML form that lets you locate any unknown kanji by just selecting from a few pulldown menus.

For various reasons, it just wasn’t a good idea to attempt to parse KANJIDIC directly from JavaScript (among other things, everything is encoded in EUC-JP instead of UTF-8), so I quickly knocked together a Perl script that read the dictionary into a SKIP-indexed data structure, and wrote it back out as a JavaScript array initialization.

Which didn’t work the first time, because, unlike Perl, you can’t have trailing commas in array or object literals. That is, this is illegal:

var skipcode = [
  {
    s1:{
      s1:['儿','八',],
      s2:['小','巛','川',],
      s3:['心','水',],
      s4:['必','旧',],
      s7:['承',],
      s8:['胤',],
      s11:['順',],
    },
  },
];

Do you know how annoying it is to have to insert extra code for “add a comma unless you’re the last item at this level” when you’re pretty-printing a complex data structure? Yes, I’m sure there are all sorts of good reasons why you shouldn’t allow those commas to exist, but gosh-darnit, they’re convenient!

No wonder BabelFish has problems with Japanese...


I mentioned in a recent comment thread that I had developed some sympathy for BabelFish’s entertaining but mostly useless translations from Japanese to English.

It started with Mahoromatic, a manga and anime series that I’m generally quite fond of. The official web site for the anime includes a lot of merchandise, and I was interested in finding out more about some of the stuff that hasn’t been officially imported to the US market. So I asked BabelFish to translate the pages, expecting to be able to make at least a little sense of the results.

It was worse than I expected, and it took me a while to figure out why. At the time, the translation engine left intact everything it had been unable to convert to English, which gave me some important clues. It was also possible to paste Unicode text into the translation window and get direct translations, which helped me narrow down the problems. [Sadly, both of these features disappeared a few months ago, making BabelFish a lot less useful.]

The clues started with the name of the show and its main character, both of which are written in hiragana on the site. BabelFish reliably converted まほろまてぃっく (Mahoromatic) to “ま top wait the ぃっく” and まほろ (Mahoro) to “ま top”. The one that really got me, though, was the live concert DVD, whose title went from 「まほろまてぃっく らいぶ!&Music Clips」 to “ま top wait the ぃっく leprosy ぶ! & in the midst of Music Clips”.

With apologies to Vernor Vinge, it was a case of “leprosy as the key insight”. It was so absurd, so out of place, that it had to be important. Fortunately, the little kana “turds” that BabelFish left behind told me exactly which hiragana characters it had translated as leprosy: らい (rai).

But rai doesn’t mean leprosy. Raibyou (らいびょう or 癩病) does, but on its own, rai is one of “since”, “defeat”, or the English loanword “lie” (which should properly be written in katakana, as ライ). So where did it come from? Rai is the pronunciation of the kanji 癩, which means leprosy. Except that it doesn’t, quite.

Here’s where it gets complicated. Every kanji character has one or more meanings and pronunciations. Some came along for the ride when the character was borrowed from China (the ON-reading), others are native inventions (the KUN-reading), but neither is necessarily a Japanese word. There are plenty of words that consist of a single kanji, such as 犬 (inu, “dog”), but not all single kanji are words.

Our friend rai is one of the latter. It has only a single ON-reading, which means leprosy, but the Japanese word for leprosy is formed by appending another kanji, 病 (byou, “sick”). So while rai really does mean leprosy, it’s not the word for leprosy. BabelFish, convinced that anything written in hiragana must be a native Japanese word, is simply trying too hard.

So what was it supposed to be? That little leftover kana at the end (ぶ) was “bu”, making the complete word “raibu”. Say it out loud, remembering that the Japanese have trouble pronouncing “l” and “v”, and it becomes “live”. The correct translation of the title should be “Mahoromatic Live! & Music Clips”; neither of the words in hiragana should be translated, because they’re not Japanese words.

In fairness to BabelFish, the folks responsible for Mahoromatic have played a dirty trick on it. It’s actually a pretty good rule of thumb that something written in hiragana is Japanese and something written in katakana is not, and, sure enough, if you feed in ライブ instead of らいぶ, it will correctly come back as “live”.

I fell for this, too, when I tried to figure out the full title of the Mahoromatic adventure game 「まほろまてぃっく☆あどべんちゃ」. The part after the star (adobencha) is written in hiragana, so I tried to interpret it as Japanese. I knew I’d gotten it wrong when I came up with “conveniently leftover tea”, but I didn’t realize I’d been BabelFished until I said it out loud.

Isn’t Japanese fun? My latest surprise came when my Rosetta Stone self-study course threw up the word ビーだま (biidama, “marble” (the toy kind)). I thought it was a typo at first, this word that was half-katakana, half-hiragana, but switching the software over to the full kanji mode converted it to ビー玉, and, sure enough, that’s what it looked like in my dictionary.

A little digging with JEdict provided the answer. Back in the days when the Portuguese started trading with Japan, their word vidro (“windowpane”) was adopted as the generic word for glass, becoming biidoro (ビードロ). The native word for sphere is dama (だま or 玉). Mash them together, and you’ve got a “glass sphere”, or a marble. Don’t go looking for other words based on biidoro, though, because it fell out of fashion a few centuries back; modern loanwords are based on garasu.

Apple's Dashboard: sample gadget


I’m not really a programmer; I’ve been a Perl hacker since ’88, though, after discovering v1.010 and asking Larry Wall where the rest of the patches were (his reply: “wait a week for 2.0”). If I’m anything, I’m a toolsmith; I mostly write small programs to solve specific problems, and usually avoid touching large projects unless they’re horribly broken in a way that affects me, and no one else can be persuaded to fix them on my schedule.

So what does this have to do with learning Japanese? Everything. I’m in the early stages of a self-study course (the well-regarded Rosetta Stone software; “ask me how to defeat their must-insert-CD-to-run copy-protection”), and authorities agree that you must learn to read using the two phonetic alphabets, Hiragana (ひらがな, used for native Japanese words) and Katakana (カタカナ, used for foreign words). A course that’s taught using Rōmaji (phonetic transcriptions using roman characters) gives you habits that will have no value in real life; Rōmaji is not used for much in Japan.

So how do you learn two complete sets of 46 symbols plus their variations and combinations, as well as their correct pronunciations? Flashcards!

The best software I’ve found for this is a Classic-only Mac application called Kana Lab (link goes direct to download), which has a lot of options for introducing the character sets, and includes recordings of a native speaker pronouncing each one. I’ve also stumbled across a number of Java and JavaScript kana flashcards, but the only one that stood out was LanguageBug, which works on Java cellphones (including my new Motorola v600).

When the misconceptions about Apple’s upcoming Dashboard feature in OS X 10.4 were cleared up (sorry, Konfabulator, it will kill your product not by being a clone, but simply by being better), I acquired a copy of the beta (why, yes, I am a paid-up member of the Apple Developer Connection) and took a look, with the goal of building a functional, flexible flashcard gadget.

Unfortunately, I’ve spent the past few years stubbornly refusing to learn JavaScript and how it’s used to manipulate HTML using the DOM, so I had to go through a little remedial course. I stopped at a Barnes & Noble on Sunday afternoon and picked up the O’Reilly JavaScript Pocket Reference and started hacking out a DHTML flashcard set, using Safari 1.2 under Panther as the platform.

Note: TextEdit and Safari do not a great DHTML IDE make. It worked, but it wasn’t fast or pretty, especially for someone who was new to JavaScript and still making stupid coding errors.

I got it working Tuesday morning, finished off the configuration form Wednesday afternoon, and squashed a few annoying bugs Wednesday night. Somewhere in there I went to work. If you’re running Safari, you can try it out here; I’ve made no attempt to cater to non-W3C DOM models, so it won’t work in Explorer or Mozilla.

There’s a lot more it could do, but right now you can select which character sets to compare, which subsets of them to include in the quiz, and you can make your guesses either by clicking with the mouse or pressing the 1-4 keys on the keyboard. I’ve deliberately kept the visual design simple, not just because I’m not a graphic designer, but also to show how Apple’s use of DHTML as the basis for gadgets makes it possible for any experienced web designer to come in and supply the chrome.

So what does it take to turn my little DHTML web page into a Dashboard gadget?

more...

“Need a clue, take a clue,
 got a clue, leave a clue”