Saturday, July 31 2004

Why I prefer Perl to JavaScript, reason #3154

For amusement, I decided that my next Dashboard gadget should be a tool for looking up characters in KANJIDIC using Jack Halpern’s SKIP system.

SKIP is basically a hash-coding system for ideographs that doesn’t rely on extensive knowledge of how they’re constructed. Once you’ve figured out how to count strokes reliably, you simply break the character into two parts according to one of several patterns, and count the number of strokes in each part. It’s not quite that simple, but almost, and it’s a lot more novice-friendly than traditional lookup methods.

Downside? The simplicity of the system results in a large number of hash collisions (only 584 distinct SKIP codes for the 6,355 characters in KANJIDIC). In the print dictionaries the system was designed for, this is handled by grouping together entries that share the same first part. Conveniently, unicode sorting seems to produce much the same effect, although a program can’t identify the groups without additional information. A simple supplementary index can easily be constructed for the relatively few SKIP codes with an absurd collision count (1-3-8 is the biggest, at 161), so it’s feasible to create a DHTML form that lets you locate any unknown kanji by just selecting from a few pulldown menus.

For various reasons, it just wasn’t a good idea to attempt to parse KANJIDIC directly from JavaScript (among other things, everything is encoded in EUC-JP instead of UTF-8), so I quickly knocked together a Perl script that read the dictionary into a SKIP-indexed data structure, and wrote it back out as a JavaScript array initialization.

Which didn’t work the first time, because, unlike Perl, you can’t have trailing commas in array or object literals. That is, this is illegal:

var skipcode = [
  {
    s1:{
      s1:['儿','八',],
      s2:['小','巛','川',],
      s3:['心','水',],
      s4:['必','旧',],
      s7:['承',],
      s8:['胤',],
      s11:['順',],
    },
  },
];

Do you know how annoying it is to have to insert extra code for “add a comma unless you’re the last item at this level” when you’re pretty-printing a complex data structure? Yes, I’m sure there are all sorts of good reasons why you shouldn’t allow those commas to exist, but gosh-darnit, they’re convenient!