Saturday, November 30 2002


Is there a reason I should care what scripting language your site is implemented in this week?

Is there a reason I should care what variable names your script uses this week?

Is there a reason I should care what directory you store your script in this week?

Is there a reason why I should see any implementation details at all, or be forced to try to cut and paste a 494-byte URL when I want to recommend your site to a friend?

And should it be harder to make a sensible URL than a ludicrous one?

(Continued on Page 101)

Friday, December 27 2002

Baby’s First Perl Module

My blushes. I’ve been hacking in Perl since version 2.0 was released. In some ways, it’s my native programming language. It’s certainly the one I use the most, and the tool I reach for first when I need to solve a problem.

But I haven’t kept up. Until quite recently, I was really still writing Perl3 with some Perl4 features, and regarded many of the Perl5-isms with horror. It felt like the Uh-Oh programmers had crufted up my favorite toy, and it didn’t help that the largest example of the “New Perl” that I had to deal with was massive, sluggish, and an unbelievable memory hog (over 9,000 lines of Perl plus an 8,000 line C++ library, and it uses up 3 Gigabytes of physical memory and 3 dedicated CPUs for its 25-hour runtime (“Hi, Andrew!”)).

(Continued on Page 99)

Sunday, July 13 2003


Simple little MT plugin, created as a generalized alternative to FlipFlop.

Given a list of keywords to be substituted into the template, each call to <MTRoundRobin> returns a different item from the list, in order, wrapping back around to the beginning when it hits the end. Examples follow.

(Continued on Page 93)


Teresa Nielsen Hayden of Making Light has a charming way of dealing with obnoxious commenters: she disemvowels them. This seems to be far more effective than simply trying to delete their comments or ban their IP addresses. She apparently does it by hand, in BBEdit. Bryant liked the idea enough to make a plugin that automatically strips the vowels out of comments coming from a specific set of IP addresses.

I don’t have any comments to deal with at the moment, but the concept amused me, and I wanted to start tinkering with the guts of MT, so I quickly knocked together a plugin that allows you to mark individual entries for disemvoweling. While I was at it, I included another way to molest obnoxious comments.

(Continued on Page 94)

D20 initiative cards

A lot of folks track combat order in D&D with index cards. I don’t know who the first person was to think of making custom index cards with a pre-printed form on them, but I first saw it at The Game Mechanics web site (great people, unfortunate choice of names).

I had just gotten back from a con where we’d run a four-party adventure with a total of five DMs, 24 players, and umpteen monsters, and the freeform index cards we used just weren’t good enough. I didn’t like the actual layout of the TGM cards, but the concept is great, and the rotate-for-character-status mechanic really improves the flow of a large combat.

My response was, of course, to come up with my own layout, adding fields and spot color to make them more useful. Along the way, I decided to increase the size from 3×5 to 4×6, greatly increasing the available space. TGM’s original cards, along with instructions on how to use them, can be found here; their forums also have several lengthy discussions on the subject.

My latest version is here. Several people have argued for a double-sided 3×5 version, and I’ve prototyped one here.

Printing Note: Acrobat has two settings that can make it annoying to print odd-sized documents: “shrink oversized pages” and “enlarge small pages.” Turn them both off if you want the cards to come out the right size.

Custom RoboRally boards

I think everyone who ever played RoboRally has toyed with the idea of making their own boards. Indeed, a quick Google will turn up dozens of sites devoted to fan-made boards and editing tools. I tried using a few of them, but the tools were clumsy and the results uninspiring.

So I did it in Adobe Illustrator, and my first original board looks like this.

(Continued on Page 96)

Monday, July 21 2003

Live Munitions!

The most popular content from is now back online: my large photo archive, consisting mostly of fully-clothed Playboy models. It’s in serious need of a complete overhaul, including rescanning every image to get rid of the worst mistakes that my flaky LS-2000 inflicted, but it’s back.

Of course, the whole collection was apparently posted to Usenet again last week, and I’m sure that a bunch of the pictures are being fraudulently sold on eBay this week, either as “real prints from the negative” or “copyright-free image CDs.” This, however, is their home, and having it back online makes it easier for me to file copyright infringement claims with ISPs.

Thursday, July 24 2003

Color combinations for web sites

I’ve stumbled across two interesting tools recently. The first is the Mac application ColorDesigner, which generates random color combinations with lots of tweakable knobs. The second is Cal Henderson’s online color-blindness simulator, designed to show you how your choice of text and background colors will appear to someone with color-deficient vision.

I decided to try to merge the two together into a single web page, using Mason and Cal’s Graphics::ColorDeficiency Perl module. It’s not done yet, but it’s already fun to play with: random web colors.

Right now, the randomizer is hardcoded, but as soon as I get a chance, I’ll be adding a form to expose all of the parameters.

Monday, July 28 2003

HTML forms suck

It didn’t shock me to discover this, but it was one of those things about the web that I hadn’t really played with seriously. Then I started trying to expose all of the parameters for my random web colors page, so people could tinker with the color-generation rules.

Not only did the form add 24K to the page size, it increased the rendering time by about an order of magnitude.

(Continued on Page 1489)

Friday, August 8 2003

PDF bullseye target generator

Spent two days this week at an Operations forum up north, and since most of the sessions had very little to do with the service I operate, I was able to do some real work while casually keeping track of the discussions.

My online target archive contains a bunch of bullseye targets I built using a Perl script. The native output format was PostScript, ’cause I like PostScript, but PDF is generally more useful today, and not everyone uses 8.5×11 paper. I hand-converted some of them, but never finished the job.

The correct solution was to completely rewrite the code using the PDF::API2::Lite Perl module, and generalize it for different paper sizes and multiple targets per page. It’s still a work in progress, but already pretty useful.

Sunday, August 17 2003

Generating random colors

My web color scheme generator is currently set up to reflect my own biases. The results are almost always readable, even for people with various forms of color-blindness, but who’s to say that my way is best?

Well, me, of course, but once or twice a year I’m willing to admit that I might be wrong about something. In recognition of that possibility, I’ll explain the syntax for the mini-language I created for the generator.

(Continued on Page 1526)

Thursday, July 8 2004

Apple’s Dashboard: sample gadget

I’m not really a programmer; I’ve been a Perl hacker since ’88, though, after discovering v1.010 and asking Larry Wall where the rest of the patches were (his reply: “wait a week for 2.0”). If I’m anything, I’m a toolsmith; I mostly write small programs to solve specific problems, and usually avoid touching large projects unless they’re horribly broken in a way that affects me, and no one else can be persuaded to fix them on my schedule.

So what does this have to do with learning Japanese? Everything. I’m in the early stages of a self-study course (the well-regarded Rosetta Stone software; “ask me how to defeat their must-insert-CD-to-run copy-protection”), and authorities agree that you must learn to read using the two phonetic alphabets, Hiragana (ひらがな, used for native Japanese words) and Katakana (カタカナ, used for foreign words). A course that’s taught using Rōmaji (phonetic transcriptions using roman characters) gives you habits that will have no value in real life; Rōmaji is not used for much in Japan.

So how do you learn two complete sets of 46 symbols plus their variations and combinations, as well as their correct pronunciations? Flashcards!

The best software I’ve found for this is a Classic-only Mac application called Kana Lab (link goes direct to download), which has a lot of options for introducing the character sets, and includes recordings of a native speaker pronouncing each one. I’ve also stumbled across a number of Java and JavaScript kana flashcards, but the only one that stood out was LanguageBug, which works on Java cellphones (including my new Motorola v600).

When the misconceptions about Apple’s upcoming Dashboard feature in OS X 10.4 were cleared up (sorry, Konfabulator, it will kill your product not by being a clone, but simply by being better), I acquired a copy of the beta (why, yes, I am a paid-up member of the Apple Developer Connection) and took a look, with the goal of building a functional, flexible flashcard gadget.

Unfortunately, I’ve spent the past few years stubbornly refusing to learn JavaScript and how it’s used to manipulate HTML using the DOM, so I had to go through a little remedial course. I stopped at a Barnes & Noble on Sunday afternoon and picked up the O’Reilly JavaScript Pocket Reference and started hacking out a DHTML flashcard set, using Safari 1.2 under Panther as the platform.

Note: TextEdit and Safari do not a great DHTML IDE make. It worked, but it wasn’t fast or pretty, especially for someone who was new to JavaScript and still making stupid coding errors.

I got it working Tuesday morning, finished off the configuration form Wednesday afternoon, and squashed a few annoying bugs Wednesday night. Somewhere in there I went to work. If you’re running Safari, you can try it out here; I’ve made no attempt to cater to non-W3C DOM models, so it won’t work in Explorer or Mozilla.

There’s a lot more it could do, but right now you can select which character sets to compare, which subsets of them to include in the quiz, and you can make your guesses either by clicking with the mouse or pressing the 1-4 keys on the keyboard. I’ve deliberately kept the visual design simple, not just because I’m not a graphic designer, but also to show how Apple’s use of DHTML as the basis for gadgets makes it possible for any experienced web designer to come in and supply the chrome.

So what does it take to turn my little DHTML web page into a Dashboard gadget?

(Continued on Page 2024)

Thursday, September 2 2004

Keeping it simple

Some of my friends are starting to wear pro-Bush t-shirts more often, which has produced some hilarious results when they’re out in public. My favorite was at a gaming convention a few months back, when a hotel employee took one look at what Rory was wearing and said “you’re not serious, are you?”.

Unfortunately, I haven’t been able to find a design that I liked. So I’m working on my own. First candidate:

Bush 8, Kerry 0

Thursday, October 21 2004

Traveller PDF mapping

Sunday was a pretty slow day, so I wrote a Perl script that generated PDF hex-maps for use in the Traveller RPG (we’re starting a D20 Traveller campaign soon). I also integrated star-system data from the standard SEC format that’s been passed around on the Internet for many years, and I’m adding an assortment of features as I find time.

Currently it prints at the sector, quadrant, and subsector level, in color and b&w, on paper sizes ranging from 4x6 to 11x17. All the heavy lifting is done with the PDF::API2::Lite module from CPAN, which has a straightforward interface.

Update: I seem to have pretty good page-rank with Google, so just in case there’s anyone else out there who’s trying to set a clipping region with PDF::API2::Lite, the magic words are:

#create some kind of path, like so
#clip to it
#start a new path

Note that this doesn’t seem to work with the alpha 0.40 versions of the PDF::API2 distribution. I’m using 0.3r77.

Wednesday, November 10 2004

sec2pdf: getting started

I often say that I’m not a programmer, I’m a problem-solver who occasionally writes code to eliminate annoyances. One recent annoyance was what passes for “state of the art” in creating star maps for the Traveller RPG.

(Continued on Page 2193)

Friday, January 14 2011


[Update: significantly improved the Perl script]

The hardest part of my cunning plan isn’t “making a screensaver”; I think every OS comes with a simple image-collage module that will do that. The fun was in collecting the images and cleaning them up for use.

Amazon’s static preview images are easy to copy; just click and save. For the zoom-enabled previews, it will fall back to static images if Javascript is disabled, so that’s just click and save as well. Unfortunately, there are an increasing number of “look inside” previews (even in the naughty-novels genre) which have no fallback, and which are not easily extracted; for now, I’ll write those off, even though some of them have gorgeous cover art.

[Update: turns out it’s easy to extract previews for the “look inside” style; just open the thumbnail in a new window, and replace everything after the first “.” with “_SS500_.jpg”.]

A bit of clicking around assembled a set of 121 pleasant images at more-or-less the same scale, with only one problem: large white borders. I wanted to crop these out, but a simple pass through ImageMagick’s convert -trim filter would recompress the JPEGs, reducing the quality. You can trim off almost all of the border losslessly with jpegtran -crop, but it doesn’t auto-crop; you have to give it precise sizes and offsets.

So I used both:

crop=$(convert -trim $f -format "%wx%h%X%Y" info:-)
jpegtran -crop $crop $f > crop/$f

So, what does a collage of naughty-novel cover-art look like? Here are some samples (warning: 280K 1440x900 JPEGs): 1, 2, 3, 4, 5. [Update: larger covers, and full-sized]

These were not, in fact, generated by taking screenshots of the screensaver. It takes a long time for it to fill up all the blank spots on the screen, so I wrote a small Perl script that uses the GD library to create a static collage using the full set of images. If I desaturated the results a bit, it would make a rather lively background screen. For home use.

Monday, January 17 2011

Cover sampler

First, yet another update to the collage-making Perl script. I added a number of tweaks so that I could reduce blank spots and overlap. If my interest holds up, I may add code to search for the largest empty areas more intelligently, but the current version works pretty well, and only took 13 seconds to generate a 1024x4096 collage of 481 naughty-book covers at 36% of their original size.

Warning! Clicking on the thumbnail loads a 1 megabyte JPEG that is unlikely to be work-safe:

Naughty-book cover sampler

Yes, I’m up to 481 covers. Some of them are second-rate, and some of the titles reflect subject matter less innocent than the cover images, but there’s a lot of terrific cheesecake in there. I’m probably going to track down the names of several of the artists so I can look for collections of their work.

One thing I didn’t do as I was idly gathering covers was keep track of the original pages on Amazon, so if I want to do something with the book titles, I have to type them back in by hand. This isn’t too bad, since I can sketch most unfamiliar kanji on my laptop’s trackpad, and find most of the remaining oddballs with Ben Bullock’s Multi-radical search tool, but it turns out that for many of the books, the really spicy bits are in the sub-titles, which are often too small to read. In some cases, the large title is just the series name.

Still, I’ve done enough to confidently state that at least one of the following words appears on the cover of almost every naughty novel in Japan. Sometimes three at once, with three at once. Cut out and save this handy guide!

(Continued on Page 3695)

Monday, February 14 2011

Parsing Japanese with MeCab

This is a public braindump, to help out anyone who might want to parse Japanese text without being sufficiently fluent in Japanese to read technical documentation. Our weapon of choice will be the morphological analysis tool MeCab, and for sanity’s sake, we’ll do everything in UTF8-encoded Unicode.

The goal will be to take a plain text file containing Japanese text, one paragraph per line, and extract every word in its dictionary form, with the correct reading, with little or no manual intervention.

(Continued on Page 3720)

Sunday, February 20 2011

Bridging The Gulf of Kanji

Imagine that you’re reading something in a foreign language that you’ve been studying for a while, and you hit an unfamiliar word. You know how to pronounce it, so you can often tell if it’s a place or a person’s name, and you’re pretty sure how many words you’re looking at, so if you need to look them up, you can.

When studying Japanese, the most frustrating thing about trying to graduate from reading “student material” to “real stuff” is not being able to do that. You’re reading along, feeling pretty good about yourself, and you run smack into a wall of kanji. Maybe it’s someone’s name, maybe it’s a city you’d recognize if you could pronounce it, or maybe it’s something like 厳重機密保持体制.

Taken individually, you know most or all of the characters, but together, wtf? Is it safe to skip over and work out from context, or do you need to carefully look up each character, crossing your fingers and hoping that it’s a straightforward collection of two-character nouns (which it is, by the way; literally “strictly-classified-preservation-system”, or, more loosely, “seriously top-secret”). Every time you stop to look something like this up, you lose continuity, and instead of reading, you’re deciphering.

I am far from the first person to notice this, and there are some well-developed tools for helping you read a Japanese web site, of which perhaps the best-known is Rikai. I don’t use it. I will occasionally use the built-in pop-up J-E dictionary on my Mac, which is a simpler version of the same thing, but what I really want to do is read books and short stories.

(Continued on Page 3725)

Tuesday, March 8 2011

Notes on finishing a novel

A novel in Japanese, that is, converted into a custom “student edition” at precisely my reading level, as described previously.

  1. Speed and comprehension are good; once I resolved the worst typos, parsing errors, and bugs in my scripts, I was able to read at a comfortable pace with only occasional confusion. Words that didn’t get looked up correctly are generally isolated and easy to work out from context, and most of the cases where I had to stop and read a sentence several times turned out to be odd because the thing they were describing was odd (such as what the guard does before allowing the original Kino to enter the city in Natural Rights). Of course, it helps to have a general knowledge of the material.
  2. Coliseum was changed significantly for the animated version of Kino’s Journey; the original story leaves most of the opponents shallow and one-dimensional, and spends way too much time on the mechanical details of Kino’s surprise (both the preparation the night before, and the detailed description of the physical impact and aftermath). Mother’s Love, on the other hand, is a pretty straight adaptation.
  3. Casual speech and dialect don’t cause as much problem as you might expect. MeCab handles a lot of the common ones, and recovers well from the ones it has to punt on. They didn’t confuse me too often, either. After a while. :-)
  4. One thing that MeCab sometimes gets wrong is when a writer uses pre-masu form instead of te-form when listing a series of actions. I don’t have a good example at the moment, but I ran into several where it punted and looked for a noun.
  5. The groups that scan, OCR, and proofread novels tend to miss some simple errors where the software guessed the wrong kanji. A good example is writing 兵士 as 兵土 or 兵上. Light novels generally aren’t that complicated, and if a word looks rare or out of place, it may well be an OCR error.
  6. The IPA dictionary used by MeCab has some quirks that make it sub-optimal for use with modern fiction. Reading 空く as あく, 他 as た, 一寸 as いっすん, 間 as ま, 縁 as えん, and 身体 as しんたい are all correct sometimes, but not in some common contexts where their Ipadic priority causes MeCab to guess wrong. Worse, it has a number of relatively high-priority entries that are not in any dictionary I’ve found: 台詞 as だいし, 胡坐 as こざ, 面す and 脱す as verbs that are more common than 面する and 脱する, etc. It also has no entries for みぞれ, 呆ける, 粘度, 街路樹, and a bunch of others. Oddest of all, there are occasions where it reads 達 as いたる; this is a valid name reading, but name+達 is far more likely to be たち than いたる; some quirk of how it calculates the appropriate left/right contexts when evaluating alternatives, an aspect of the dictionary files that I definitely don’t understand.
  7. I need to make better use of the original furigana when evaluating MeCab output. I’m preserving it, but not using it to automatically detect some of the above errors. Mostly because I don’t want the scripts to become too interactive. Perhaps just flagging the major differences is sufficient.
  8. On to book 2!

Sunday, August 4 2013

Yomitori 1.0

About two and a half years ago, I threw together a set of Perl scripts that converted Japanese novels into nicely-formatted custom student editions with additional furigana and per-page vocabulary lists. I said I’d release it, but the code was pretty raw, the setup required hacking at various packages in ways I only half-remembered, and the output had some quirks. It was good enough for me to read nearly two dozen novels with decent comprehension, but not good enough to share.

When I ran out of AsoIku novels to read, I decided it was time to start over. I set fire to my toolchain, kept only snippets of the old code, and made it work without hacking on anyone else’s packages. Along the way, I switched to a much better parsing dictionary, significantly improved lookup of phrases and expressions, and made the process Unicode-clean from start to finish, with no odd detours through S-JIS.

Still some work to do (including that funny little thing called “documentation”…), but it makes much better books than the old one, and there are only a few old terrors left in the code. So now I’m sharing it.

Sunday, August 11 2013

Yomitori sample output

I’ve made quite a few improvements since putting the code up on Github. Just having it out in public made me clean it up a lot, but trying to produce a decent sample made an even bigger difference. QAing the output of my scripts has smoked out a number of errors in the original texts, as well as some interesting errors and inconsistencies in Unidic and JMdict.

The sample I chose to include in the Github repo is a short story we went through in my group reading class, Lafcadio Hearns’ Houmuraretaru Himitsu (“A Dead Secret”). The PDF files (text and vocab) are designed to be read side-by-side (I use two Kindles, but they fit nicely on most laptop screens), while the HTML version uses jQuery-based tooltips to show the vocabulary on mouseover.

For use as a sample, I left in a lot of words that I know. If I were generating this story for myself, I’d use a much larger known-words list.

Monday, August 26 2013

Yomitori expressions

Mecab/Unidic has fairly strict ideas about morphemes. For instance, in the older Ipadic morphological dictionary,日本語 is one unit, “Nihongo = Japanese language”, while in Unidic, it’s 日本+語, “Nippon + go = Japan + language”. This has some advantages, but there are two significant disadvantages, both related to trying to look up the results in the JMdict J-E dictionary.

First, I frequently need to concatenate N adjacent morphemes before looking them up, because the resulting lexeme may have a different meaning. For instance, Unidic considers the common expression 久しぶりに to consist of the adjective 久しい, the suffix 振り, and the particle に. It also thinks that the noun 日曜日 is a compound of the two nouns 日曜 and 日, neglecting the consonant shift on the second one.

Second, there are a fair number of cases where Unidic considers something a morpheme that JMdict doesn’t even consider a distinct word. For instance, Unidic has 蹌踉き出る as a single morpheme, while JMdict and every other dictionary I’ve got considers it to be the verb 蹌踉めく (to stagger) plus the auxiliary verb 出る (to come out). The meaning is obvious if you know what 蹌踉めく means, but I can’t automatically break 蹌踉き出る into two morphemes and look them up separately, because Unidic thinks it’s only one.

To fix the second problem, I’m going to need to add a bit of code to the end of my lookup routines that says, “if I still haven’t found a meaning for this word, and it’s a verb, and it ends in a common auxiliary verb, then strip off the auxiliary, try to de-conjugate it, and run it through Mecab again”. I haven’t gotten to this yet, because it doesn’t happen too often in a single book.

To fix the first problem, I start by trying to lookup an entire clause as a single word, then trim off the last morpheme and try again, repeating until I either get a match or run out of morphemes. I had built up an elaborate and mostly-successful set of heuristics to prevent false positives, but they had the side effect of also eliminating many perfectly good compounds and expressions. And JMdict actually has quite a few lengthy expressions and set phrases, so while half-assed heuristics were worthwhile, making them better would pay off quickly.

Today, while working on something else entirely, I realized that there was a simple way (conceptually simple, that is; implementation took a few tries) to eliminate a lot of false positives and a lot of the heuristics: pass the search results back through Mecab and see if it produces the same number of morphemes with the same parts of speech.

So, given a string of morphemes like: (い, て, も, 立っ, て, も, い, られ, ない, ほど, に, なっ, て, いる), on the sixth pass I lookup いても立ってもいられない and find a match (居ても立っても居られない, “unable to contain oneself”) that breaks down into the same number of morphemes of the same type. There’s still a chance it could have chosen a wrong word somewhere (and, in fact, it did; the initial い was parsed as 行く rather than 居る, so a stricter comparison would have failed), but as a heuristic, it works much better than everything else I’ve tried, and has found some pretty impressive matches:

  • 口が裂けても言えない (I) won’t say anything no matter what
  • たまったものではない intolerable; unbearable
  • 痛くもかゆくもない of no concern at all; no skin off my nose
  • 火を見るより明らか perfectly obvious; plain as daylight
  • 言わんこっちゃない I told you so
  • どちらかと言えば if pushed I’d say
  • と言えなくもない (one) could even say that
  • 似ても似つかない not bearing the slightest resemblance
  • 痛い目に遭わせる to make (a person) pay for (something)
  • と言って聞かない insisting
  • 聞き捨てならない can’t be allowed to pass (without comment)

I’ve updated the samples I created a few weeks ago to use the new parsing. Even on that short story, it makes a few sections clearer.

[Update: one of the rare false positives from this approach: 仲には and 中には break down the same, and since the second one is in JMdict as an expression, doing a kana-only lookup on なかには will falsely apply the meaning “among (them)” to 仲には. Because of variations in orthography, I have to allow for kana-only lookups, especially for expressions, but fortunately this sort of false match is rare and easy to catch while reading.]

Saturday, September 28 2013

Yomitori for Windows

The hardest part of getting my Japanese-novel-hacking scripts working on Windows was figuring out how to build the Text::MeCab Perl module. Strawberry Perl includes a complete development environment, but the supplied library file libmecab.lib simply didn’t work. I found some instructions on how to build MeCab from source with MinGW, but that was not a user-friendly install.

However, there were enough clues scattered around that I was able to figure out how to use the libmecab.DLL file that was hidden in another directory:

copy \Program Files\MeCab\sdk\mecab.h \strawberry\c\include
copy \Program Files\MeCab\bin\libmecab.dll \strawberry\perl\bin
cd \strawberry\perl\bin
pexports libmecab.dll > libmecab.def
dlltool -D libmecab.dll -l libmecab.a -d libmecab.def
move libmecab.a ..\..\c\lib
del libmecab.def
cpanm --interactive Text::MeCab

The interactive install is necessary because there’s no mecab-config shell script in the Windows distribution. The manual config is version: 0.996, compiler flags: -DWIN32, linker flags: -lmecab, encoding: utf-8.