Friday, December 27 2002

Baby’s First Perl Module

My blushes. I’ve been hacking in Perl since version 2.0 was released. In some ways, it’s my native programming language. It’s certainly the one I use the most, and the tool I reach for first when I need to solve a problem.

But I haven’t kept up. Until quite recently, I was really still writing Perl3 with some Perl4 features, and regarded many of the Perl5-isms with horror. It felt like the Uh-Oh programmers had crufted up my favorite toy, and it didn’t help that the largest example of the “New Perl” that I had to deal with was massive, sluggish, and an unbelievable memory hog (over 9,000 lines of Perl plus an 8,000 line C++ library, and it uses up 3 Gigabytes of physical memory and 3 dedicated CPUs for its 25-hour runtime (“Hi, Andrew!”)).

(Continued on Page 99)

Sunday, July 13 2003


Simple little MT plugin, created as a generalized alternative to FlipFlop.

Given a list of keywords to be substituted into the template, each call to <MTRoundRobin> returns a different item from the list, in order, wrapping back around to the beginning when it hits the end. Examples follow.

(Continued on Page 93)


Teresa Nielsen Hayden of Making Light has a charming way of dealing with obnoxious commenters: she disemvowels them. This seems to be far more effective than simply trying to delete their comments or ban their IP addresses. She apparently does it by hand, in BBEdit. Bryant liked the idea enough to make a plugin that automatically strips the vowels out of comments coming from a specific set of IP addresses.

I don’t have any comments to deal with at the moment, but the concept amused me, and I wanted to start tinkering with the guts of MT, so I quickly knocked together a plugin that allows you to mark individual entries for disemvoweling. While I was at it, I included another way to molest obnoxious comments.

(Continued on Page 94)

Thursday, July 24 2003

Color combinations for web sites

I’ve stumbled across two interesting tools recently. The first is the Mac application ColorDesigner, which generates random color combinations with lots of tweakable knobs. The second is Cal Henderson’s online color-blindness simulator, designed to show you how your choice of text and background colors will appear to someone with color-deficient vision.

I decided to try to merge the two together into a single web page, using Mason and Cal’s Graphics::ColorDeficiency Perl module. It’s not done yet, but it’s already fun to play with: random web colors.

Right now, the randomizer is hardcoded, but as soon as I get a chance, I’ll be adding a form to expose all of the parameters.

Monday, July 28 2003

HTML forms suck

It didn’t shock me to discover this, but it was one of those things about the web that I hadn’t really played with seriously. Then I started trying to expose all of the parameters for my random web colors page, so people could tinker with the color-generation rules.

Not only did the form add 24K to the page size, it increased the rendering time by about an order of magnitude.

(Continued on Page 1489)

Friday, August 8 2003

PDF bullseye target generator

Spent two days this week at an Operations forum up north, and since most of the sessions had very little to do with the service I operate, I was able to do some real work while casually keeping track of the discussions.

My online target archive contains a bunch of bullseye targets I built using a Perl script. The native output format was PostScript, ’cause I like PostScript, but PDF is generally more useful today, and not everyone uses 8.5×11 paper. I hand-converted some of them, but never finished the job.

The correct solution was to completely rewrite the code using the PDF::API2::Lite Perl module, and generalize it for different paper sizes and multiple targets per page. It’s still a work in progress, but already pretty useful.

Saturday, August 9 2003

The Perl Script From Hell

I’ve been working with Perl since about two weeks before version 2.0 was released. Over those fifteen years, I’ve seen a lot of hairy Perl scripts, many of them mine.

None of them can compare to the monster that lurks in the depths of our service, though. Over 8,000 lines of Perl plus an 8,000-line C++ module, written in a style that’s allegedly Object Oriented, but which I would describe as Obscenely Obfuscated (“Hi, Andrew!”).

We have five large servers devoted to running it. Each contributes three CPUs, three gigabytes of memory, and 25 hours of runtime to the task (independently; we need the redundancy if one of them crashes). Five years ago, I swore a mighty oath to never, ever get involved with the damned thing.

Then it broke. In a way that involved tens of thousands of unhappy customers.

(Continued on Page 1511)

Wednesday, September 10 2003

iTMS weekly reports

No, I didn’t buy another big batch of music from the iTunes Music Store yet, although I probably will soon, to stock up the iPod for my next road trip to Las Vegas. I have been keeping an eye on the store, though, and after corresponding with Brian Tiemann, I decided to investigate an oddity we’d both noticed: the week-by-week “Just Added” report ain’t no such thing.

(Continued on Page 1579)

Saturday, July 31 2004

Why I prefer Perl to JavaScript, reason #3154

For amusement, I decided that my next Dashboard gadget should be a tool for looking up characters in KANJIDIC using Jack Halpern’s SKIP system.

SKIP is basically a hash-coding system for ideographs that doesn’t rely on extensive knowledge of how they’re constructed. Once you’ve figured out how to count strokes reliably, you simply break the character into two parts according to one of several patterns, and count the number of strokes in each part. It’s not quite that simple, but almost, and it’s a lot more novice-friendly than traditional lookup methods.

Downside? The simplicity of the system results in a large number of hash collisions (only 584 distinct SKIP codes for the 6,355 characters in KANJIDIC). In the print dictionaries the system was designed for, this is handled by grouping together entries that share the same first part. Conveniently, unicode sorting seems to produce much the same effect, although a program can’t identify the groups without additional information. A simple supplementary index can easily be constructed for the relatively few SKIP codes with an absurd collision count (1-3-8 is the biggest, at 161), so it’s feasible to create a DHTML form that lets you locate any unknown kanji by just selecting from a few pulldown menus.

For various reasons, it just wasn’t a good idea to attempt to parse KANJIDIC directly from JavaScript (among other things, everything is encoded in EUC-JP instead of UTF-8), so I quickly knocked together a Perl script that read the dictionary into a SKIP-indexed data structure, and wrote it back out as a JavaScript array initialization.

Which didn’t work the first time, because, unlike Perl, you can’t have trailing commas in array or object literals. That is, this is illegal:

var skipcode = [

Do you know how annoying it is to have to insert extra code for “add a comma unless you’re the last item at this level” when you’re pretty-printing a complex data structure? Yes, I’m sure there are all sorts of good reasons why you shouldn’t allow those commas to exist, but gosh-darnit, they’re convenient!

Friday, June 17 2005


Every time I include some Japanese text in a blog entry, I’m torn between adding furigana and, well, not. It’s extremely useful for people who don’t read kanji well, but it’s tedious to do by hand in HTML. At the same time, I find myself wishing that my Rosetta Stone courseware included furigana, so that I could hover the mouse over a word and see the pronunciation instead of switching from kanji to kana mode and back. I’d also like to see their example phrases in a better font, at higher resolution.

80 lines of Perl later:



[update: I tested this under IE on my Windows machine at work, and it correctly displayed the pop-up furigana, but ignored the CSS that highlighted the word it applied to; apparently my machine has extra magic installed, because the pop-up doesn’t use a Unicode font for some people. Sigh. Found! fixing tooltips in IE (about halfway down the page)]

(Continued on Page 2332)

Thursday, September 29 2005

The cleansing power of Quartz

So my new Japanese class has started (lousy classroom, good teacher, reasonable textbook, nice group of students, unbelievably gorgeous teacher’s assistant (I will never skip class…)), and, as expected, the teacher is pushing us to master hiragana quickly. I did that quite a while ago, so while everyone else is trying to learn it from scratch, I can focus on improving my handwriting.

One thing she suggested was a set of flash cards. I had mine with me, and mentioned that they were available for a quite reasonable price at Kinokuniya. Her response was along the lines of “yes, I know, but nobody ever buys optional study materials; do you think you could photocopy them so I can make handouts?”

I could, but that wouldn’t be nearly as much fun as making my own set. The first step was finding a decent kana font. Mac OS X ships with several Unicode fonts that include the full Japanese kana and kanji sets, but they didn’t meet my needs: looking good at display sizes, and clearly showing the boundaries between strokes. I found Adobe Ryo Display Standard. TextEdit seems to be a bit confused about its line-height, but I wasn’t planning to create the cards in that app anyway.

How to generate the card images? Perl, of course, with the PDF::API2::Lite module. I could have written a script that calculated the correct size of cards to fill the page, but I was feeling lazy, so I wrote a 12-line script to put one large character in the middle of a page, loaded the results into Preview, set the print format to 16-up with a page border, and printed to a new PDF file. Instant flash cards.

For many people, this would be sufficient, but one of the things sensei liked about the cards I had brought was the numbers and arrows that indicated the correct stroke order. There was no lazy way to do this, so I used Adobe Acrobat’s drawing and stamping tools. The stamping tool lets you quickly decorate a PDF file with images in many formats, so I just modified my previous script to create PDF files containing single numbers, and imported them into the stamp tool. The line-drawing tool let me make arrows, although I couldn’t figure out a simple way to set my own line-width and have it remembered (1pt was too thin after the 16-up, and 2pt had too-big arrowheads).

So why is this post titled “the cleansing power of Quartz”? Because the one-per-page annotated output from Acrobat was more than six times larger than the same file printed 16-up from Preview. Just printing the original file back to PDF shrank it by a factor of four, which, coincidentally, is almost exactly what you get when you run gzip on it…

The final results are here.

Monday, November 7 2005

Flashcards revisited

A while back, I made quick and dirty hiragana flashcards, using the Mac OS X print dialog to print single-word pages 16-up. As my Japanese class moves along, though, there’s a need for something more sophisticated. Each lesson in our book includes a number of kanji words that will be on the test, and while my previous method will work, the hard-coded font sizes and word placement get messy to maintain.

If I’m going to write an honest-to-gosh flashcard generator, though, I might as well go whole-hog and make it capable of printing study words vertically, the way they’d be printed in a book or newspaper. Learning to recognize horizontal text might get me through the test, but it’s not enough for real Japanese literacy.

Here’s the Perl script (requires PDF::API2::Lite), a horizontal example, and vertical example. You’ll need to supply the name of your own TrueType/OpenType font that includes the kanji, unless you happen to have a copy of Kozuka Mincho Pro Regular around the house.

Note that the above PDF files have been significantly reduced in size (by an order of magnitude!) by using Mac OS X’s Preview app and saving them with the Quartz filter “Reduce File Size”. The words in the sample are from the review sheet for this week’s lesson…

Update: One problem with my vertical-printing solution quickly became obvious, and I don’t have a good solution for it. The short version is “Unicode is meaning, not appearance”, so variant glyphs can’t be easily selected, even if they’re present in your font. Specifically, the katakana prolonged-sound mark 「ー」 should be a vertical line when you’re writing vertically. Also, all of the small kana 「ぁぃぅぇぉっ」 should be offset up and to the right, and good fonts include correct variants, but I can cheat on that one; I just need to move the glyph, not change its shape.

No one seems to have figured out the necessary font-encoding tricks to pull this off with PDF::API2. At least, it’s not turning up in any google incantation I try, which leaves me with one conceptually disgusting workaround: rotate and flip. Calligraphers and type-weenies will cringe, but at text sizes it will pass. The correct character is on the left:

vertical kana hack

Now to write the code for both workarounds…

[side note: Adobe’s premier software suite is remarkably fragile; I just got it into a state where I couldn’t run Photoshop. How? I started Illustrator, which opened Adobe’s software-update tool in the background, then quit Illustrator. When I started Photoshop, it tried to open the update tool again, couldn’t, and crashed.]

Saturday, April 8 2006

“What do you do with a B6 notebook?”

(note: for some reason, my brain keeps trying to replace the last two words in the subject with “drunken sailor”; can’t imagine why)

Kyokuto makes some very nice notebooks. Sturdy covers in leather or plastic, convenient size, and nicely formatted refill pages. I found them at MaiDo Stationery, but Kinokuniya carries some of them as well. I like the B6 size best for portability; B5 is more of an office/classroom size, and A5 just seems to be both too big and too small. B6 is also the size that Kodansha publishes all their Japanese reference books in, including my kanji dictionary, which is a nice bonus.

[This is, by the way, the Japanese B6 size rather than the rarely-used ISO B-series. When Japan adopted the ISO paper standard, the B-series looked just a wee bit too small, so they redefined it to have 50% larger area than the corresponding A-series size. Wikipedia has the gory details.]

I really like the layout of Kyokuto’s refill paper. So much so, in fact, that I used PDF::API2::Lite to clone it. See? The script is a little rough at the moment, mostly because it also does 5mm grid paper, 20x20 tategaki report paper, and B8/3 flashcards, and I’m currently adding kanji practice grids with the characters printed in gray in my Kyoukasho-tai font. I’ll post it later after it’s cleaned up.

Why, yes, I was stuck in the office today watching a server upgrade run. However did you guess?

On a related note, am I the only person in the world who thinks that it’s silly to spend $25+ on one of those gaudy throwaway “journals” that are pretty much the only thing you can find in book and stationery stores these days? Leather/wood/fancy cover, magnet/strap/sticks to hold it shut, handmade/decorated (possibly even scented) papers, etc, etc. No doubt the folks who buy these things also carry a fountain pen with which to engrave their profound thoughts upon the page.

Or just to help them impress other posers.

Thursday, May 18 2006

PDF::API2,, kanji fonts, and me

I’d love to know why this PDF file displays its text correctly in Acrobat Reader, but not in (compare to this one, which does). Admittedly, the application generating it is including the entire font, not just the subset containing the characters used (which is why it’s so bloody huge), but it’s a perfectly reasonable thing to do in PDF. A bit rude to the bandwidth-impaired, perhaps, but nothing more.

While I’m on the subject of flaws in, let me point out two more. One that first shipped with Tiger is the insistence on displaying and printing Aqua data-entry fields in PDF files containing Acrobat forms, even when no data has been entered. Compare and contrast with Acrobat, which only displays the field boundaries while that field has focus. Result? Any page element that overlaps a data-entry field is obscured, making it impossible to view or print the blank form. How bad could it be? This bad (I’ll have to make a screenshot for the users…).

The other problem is something I didn’t know about until yesterday (warning: long digression ahead). I’ve known for some time that only certain kanji fonts will appear in when I generate PDFs with PDF::API2 (specifically, Kozuka Mincho Pro and Ricoh Seikaisho), but for a while I was willing to work with that limitation. Yesterday, however, I broke down and bought a copy of the OpenType version of DynaFont’s Kyokasho, specifically to use it in my kanji writing practice. As I sort-of expected, it didn’t work.

[Why buy this font, which wasn’t cheap? Mincho is a Chinese style used in books, magazines, etc; it doesn’t show strokes the way you’d write them by hand. Kaisho is a woodblock style that shows strokes clearly, but they’re not always the same strokes. Kyoukasho is the official style used to teach kanji writing in primary-school textbooks in Japan. (I’d link to the nice page at sci.lang.japan FAQ that shows all of them at once, but it’s not there any more, and several of the new pages are just editing stubs; I’ll have to make a sample later)]

Anyway, what I discovered was that if you open the un-Preview-able PDF in the full version of Adobe Acrobat, save it as PostScript, and then let convert it back to PDF, not only does it work (see?), the file size has gone from 4.2 megabytes to 25 kilobytes. And it only takes a few seconds to perform this pair of conversions.

Wouldn’t it be great to automate this task using something like AppleScript? Yes, it would. Unfortunately, is not scriptable. Thanks, guys. Fortunately, Acrobat Distiller is scriptable and just as fast.

On the subject of “why I’m doing this in the first place,” I’ve decided that the only useful order to learn new kanji in is the order they’re used in the textbooks I’m stuck with for the next four quarters. The authors don’t seem to have any sensible reasons for the order they’ve chosen, but they average 37 new kanji per lesson, so at least they’re keeping track. Since no one else uses the same order, and the textbooks provide no support for actually learning kanji, I have to roll my own.

There are three Perl scripts involved, which I’ll clean up and post eventually: the first reads a bunch of vocabulary lists and figures out which kanji are new to each lesson, sorted by stroke count and dictionary order; the second prints out the practice PDF files; the third is for vocabulary flashcards, which I posted a while back. I’ve already gone through the first two lessons with the Kaisho font, but I’m switching to the Kyoukasho now that I’ve got it working.

Putting it all together, my study sessions look like this. For each new kanji, look it up in The Kanji Learner’s Dictionary to get the stroke order, readings, and meaning; trace the Kyoukasho sample several times while mumbling the readings; write it out 15 more times on standard grid paper; write out all the readings on the same grid paper, with on-yomi in katakana and kun-yomi in hiragana, so that I practice both. When I finish all the kanji in a lesson, I write out all of the vocabulary words as well as the lesson’s sample conversation. Lather, rinse, repeat.

My minimum goal is to catch up on everything we used in the previous two quarters (~300 kanji), and then keep up with each lesson as I go through them in class. My stretch goal is to get through all of the kanji in the textbooks by the end of the summer (~1000), giving me an irregular but reasonably large working set, and probably the clearest handwriting I’ve ever had. :-)

Thursday, June 1 2006


I want a better text editor. What I really, really want, I think, is Gnu-Emacs circa 1990, with Unicode support and a fairly basic Cocoa UI. What I’ve got now is the heavily-crufted modern Gnu-Emacs supplied with Mac OS X, running in, and when I need to type kanji into a plain-text file.

So I’ve been trying out TextWrangler recently, whose virtues include being free and supporting a reasonable subset of Emacs key-bindings. Unfortunately, the default configuration is J-hostile, and a number of settings can’t be changed for the current document, only for future opens, and its many configuration options are “less than logically sorted”.

What don’t I like?

First, the “Documents Drawer” is a really stupid idea, and turning it off involves several checkboxes in different places. What’s it like? Tabbed browsing with invisible tabs; it’s possible to have half a dozen documents open in the same window, with no visual indication that closing that window will close them all, and the default “close” command does in fact close the window rather than a single document within it.

Next, I find the concept of a text editor that needs a “show invisibles” option nearly as repulsive as a “show invisibles” option that doesn’t actually show all of the invisible characters. Specifically, if you select the default Unicode encoding, a BOM character is silently inserted at the beginning of your file. “Show invisibles” won’t tell you; I had to use /usr/bin/od to figure out why my furiganizer was suddenly off by one character.

Configuring it to use the same flavor of Unicode as TextEdit and other standard Mac apps is easy once you find it in the preferences, but fixing damaged text files is a bit more work. TextWrangler won’t show you this invisible BOM character, and /usr/bin/file doesn’t differentiate between Unicode flavors. I’m glad I caught it early, before I had dozens of allegedly-text files with embedded 文字化け. The fix is to do a “save as…”, click the Options button in the dialog box, and select the correct encoding.

Basically, over the course of several days, I discovered that a substantial percentage of the default configuration settings either violated the principle of least surprise or just annoyed the living fuck out of me. I think I’ve got it into a “mostly harmless” state now, but the price was my goodwill; where I used to be lukewarm about the possibility of buying their higher-end editor, BBEdit, now I’m quite cool: what other unpleasant surprises have they got up their sleeves?

By contrast, I’m quite fond of their newest product, Yojimbo, a mostly-free-form information-hoarding utility. It was well worth the price, even with its current quirks and limitations.

Speaking of quirks, my TextWrangler explorations yielded a fun one. One of its many features, shared with BBEdit, is a flexible syntax-coloring scheme for programming languages. Many languages are supported by external modules, but Perl is built in, and their support for it is quite mature.

Unfortunately for anyone writing an external parser, Perl’s syntax evolved over time, and was subjected to some peculiar influences. I admit to doing my part in this, as one of the first people to realize that the arguments to the grep() function were passed by reference, and that this was really cool and deserved to be blessed. I think I was also the first to try modifying $a and $b in a sort function, which was stupid, but made sense at the time. By far the worst, however, from the point of view of clarity, was Perl poetry. All those pesky quotes around string literals were distracting, you see, so they were made optional.

This is still the case, and while religious use of use strict; will protect you from most of them, there are places where unquoted string literals are completely unambiguous, and darn convenient as well. Specifically, when an unquoted string literal appears in list context followed by the syntactic sugar “=>” [ex: (foo => “bar”)], and when it appears in scalar context surrounded by braces [ex: $x{foo}].

TextWrangler and BBEdit are blissfully unaware of these “bareword” string literals, and make no attempt to syntax-color them. I think that’s a reasonable behavior, whether deliberate or accidental, but it has one unpleasant side-effect: interpreting barewords as operators.

Here’s the stripped-down example I sent them, hand-colored to match TextWrangler’s incorrect parsing:


use strict;

my %foo;
$foo{a} = 1;
$foo{x} = 0;

my %bar = (y=>1,z=>1,x=>1);

$foo{y} = f1() + f2() + f3();

sub f1 {return 0}
sub f2 {return 1}

sub f3 {return 2}

Wednesday, November 15 2006

Spam-guarding email addresses

I’ve been playing with jQuery recently. The major project I’m just about ready to roll out is a significant improvement to my pop-up furigana-izer. If native tooltips actually worked reliably in browsers, it would be fine, but they don’t, so I spent a day sorting out all of the issues, and while I was at it, I also added optional kana-to-romaji conversion.

I’ll probably roll that out this weekend, updating a bunch of my old Japanese entries to use it, but while I was finishing it up I had another idea: effective spam-protection for email addresses on my comment pages. The idea is simple. Replace all links that look like this: <a href=””>email</a> with something like this: <a mailto=”i4t p5au l@4 h6p 6la 7wbt” href=”“>email</a>, and use jQuery to reassemble the real address from a hash table when the page is loaded, inserting it into the HREF attribute.

A full working example looks like this:

(Continued on Page 2632)

Friday, November 24 2006

JMdict + XML::Twig + DBD::SQLite = ?

In theory, I’m still working at Digeo. In practice, not so much. As we wind our way closer to the layoff date, I have less and less actual work to do, and more and more “anticipating the future failures of our replacements”. On the bright side, I’ve had a lot of time to study Japanese and prepare for Level 3 of the JLPT, which is next weekend.

I’m easily sidetracked, though, and the latest side project is importing the freely-distributed JMdict/EDICT and KANJIDIC dictionaries into a database and wrapping it with Perl, so that I can more easily incorporate them into my PDF-generating scripts.

Unfortunately, all of the tree-based XML parsing libraries for Perl create massive memory structures (I killed the script after it got past a gig), and the stream-based ones don’t make a lot of sense to me. XML::Twig’s documentation is oriented toward transforming XML rather than importing it, but it’s capable of doing the right thing without writing ridiculously unPerly code:

my $twig = new XML::Twig(
        twig_handlers => { entry => \&parse_entry });
sub parse_entry {
        my $ref = $_[1]->simplify;
        print "Entry ID=",$ref->{ent_seq},"\n";

SQLite was the obvious choice for a back-end database. It’s fast, free, stable, serverless, and Apple’s supporting it as part of Core Data.

Putting it all together meant finally learning how to read XML DTDs, getting a crash course in SQL database design to lay out the tables in a sensible way, and writing useful and efficient queries. I’m still working on that last part, but I’ve gotten far enough that my lookup tool has basic functionality: given a complete or partial search term in kanji, kana, or English, it returns the key parts of the matching entries. Getting all of the available data assembled requires both joins and multiple queries, which is tedious to sort out.

I started with JMdict, which is a lot bigger and more complicated, so importing KANJIDIC2 is going to be easy when I get around to it. That won’t be this week, though, because the JLPT comes but once a year, and finals week comes right after.

[side note: turning off auto-commit and manually committing after every 500th entry cut the import time by 2/3]

Sunday, July 8 2007


Restoring Chizumatic’s sidebar to its rightful place was a task worth pursuing, but since the Minx templates generate tag soup, standard validation tools produced too many errors to help much (W3C’s produced ~700 errors, compared to this page’s 16, 14 of which are parser errors in Amazon search URLs).

So I tried a different approach:

while (<STDIN>) {
    next unless /<\/?div/i;
    if (@x = /<div/gi) {
        print "[$l] $. ",@x+0," $_\n";
        $l += @x;
    if (@x = /<\/div/gi) {
        print "[$l] $. ",@x+0," $_\n";
        $l -= @x;
print "[$l]\n";

Skimming through the output, I saw that the inline comments started at level 6, until I reached comment 8 in the “Shingu 20” entry, which started at level 7. Sure enough, what should have been a (pardon my french) </div></p></div> in the previous comment was just a </p></div>.

[Update: fixing one bad Amazon URL removed 14 of the 16 validation errors on this page, and correcting a Movable Type auto-formatting error got rid of the other two. See, validation is easy! :-)]

Monday, October 15 2007

Sony Reader 505

So I bought the second-generation Sony Reader. Thinner, faster, crisper screen, cleaned-up UI, USB2 mass storage for easy import, and some other improvements over the previous one. It still has serious limitations, and in a year or two it will be outclassed at half the price, but I actually have a real use for a book-sized e-ink reader right now: I’m finally going to Japan, and we’ll be playing tourist.

My plan is to dump any and all interesting information onto the Reader, and not have to keep track of travel books, maps, etc. It has native support for TXT, PDF, PNG, and JPG, and there are free tools for converting and resizing many other formats.

Letter and A4-sized PDFs are generally hard to read, but I have lots of experience creating my own custom-sized stuff with PDF::API2::Lite, so that’s no trouble at all. The PDF viewer has no zoom, but the picture viewer does, so I’ll be dusting off my GhostScript-based pdf2png script for maps and other one-page documents that need to be zoomed.

I’ll write up a detailed review soon, but so far there’s only one real annoyance: very limited kanji support. None at all in the book menus, which didn’t surprise me, and silent failure in the PDF viewer, which did. Basically, any embedded font in a PDF file is limited to 512 characters; if it has more, characters in that font simply won’t appear in the document at all.

The English Wikipedia and similar sites tend to work fine, because a single document will only have a few words in Japanese. That’s fine for the trip, but now that I’ve got the thing, I want to put some reference material on it. I have a script that pulls data from EDICT and KANJIDIC and generates a PDF kanji dictionary with useful vocabulary, but I can’t use it on the Reader.

…unless I embed multiple copies of the same font, and keep track of how many characters I’ve used from each one. This turns out to be trivial with PDF::API2::Lite, but it does significantly increase the size of the output file, and I can’t clean it up in Acrobat Distiller, because that application correctly collapses the duplicates down to one embedded font.

I haven’t checked to see if the native Librie format handles font-embedding properly. I’ll have to install the free Windows software at some point and give it a try.

[Update: I couldn’t persuade Distiller to leave the multiple copies of the font alone, because OpenType CID fonts apparently embed a unique ID in several places. FontForge was perfectly happy to convert it to a non-CID TrueType font, and then I only had to rename one string to create distinct fonts for embedding. My test PDF works fine on the Reader now.]

Monday, May 19 2008

Importing furigana into Word

Aozora Bunko is, more or less, the Japanese version of Project Gutenberg. As I’ve mentioned before, they have a simple markup convention to handle phonetic guides and textual notes. The notes can get a bit complicated, referring to obsolete kanji and special formatting, but the phonetic part is simple to parse.

I can easily convert it to my pop-up furigana for online use (which I think is more useful than the real thing at screen resolution), but for my reading class, it would be nice to make real furigana to print out. A while back I started tinkering with using Word’s RTF import for this, but gave up because it was a pain in the ass. Among other problems, the RTF parser is very fragile, and syntax errors can send it off into oblivion.

Tonight, while I was working on something else, I remembered that Word has an allegedly reasonable HTML parser, and IE was the first browser to support the HTML tags for furigana. So I stripped the RTF code out of my script, generated simple HTML, and sent it to Word. Success! Also a spinning beach-ball for a really long time, but only for the first document; once Word loaded whatever cruft it needed, that session would convert subsequent HTML documents quickly. It even obeys simple CSS, so I could set the main font size and line spacing, as well as the furigana size.

Two short Perl scripts: shiftjis2utf8 and aozora-ruby.

[Note that Aozora Bunko actually supplies XHTML versions of their texts with properly-tagged furigana, but they also do some other things to the text that I don’t want to try to import into Word, like replacing references to obsolete kanji with PNG files.]

Wednesday, July 30 2008

Make More People!

I’m doing some load-testing for our service, focusing first on the all-important Christmas Morning test: what happens when 50,000 people unwrap their presents, find your product, and try to hook it up. This was a fun one at WebTV, where every year we rented CPUs and memory for our Oracle server, and did a complicated load-balancing dance to support new subscribers while still giving decent response to current ones. [Note: it is remarkably useful to be able to throw your service into database-read-only mode and point groups of hosts at different databases.]

My first problem was deciphering the interface. I’ve never worked with WSDL before, and it turns out that the Perl SOAP::WSDL package has a few quirks related to namespaces in XSD schemas. Specifically, all of the namespaces in the XSD must be declared in the definition section of the WSDL to avoid “unbound prefix” errors, and then you have to write a custom serializer to reinsert the namespaces after gleefully strips them all out for you.

Once I could register one phony subscriber on the test service, it was time to create thousands of plausible names, addresses, and (most importantly) phone numbers scattered around the US. Census data gave me a thousand popular first and last names, as well as a comprehensive collection of city/state/zip values. Our CCMI database gave me a full set of valid area codes and prefixes for those zips. The only thing I couldn’t find a decent source for was street names; I’m just using a thousand random last names for now.

I’m seeding the random number generator with the product serial number, so that 16728628 will always be Elisa Wallace on W. Westrick Shore in Crenshaw, MS 38621, with a number in the 662 area code.

Over the next few days, I’m going to find out how many new subscribers I can add at a time without killing the servers, as well as how many total they can support without exploding. It should be fun.

Meanwhile, I can report that in Mac OS X 10.5.4 cheerfully handles converting a 92,600-page PostScript file into PDF. It took about fifteen minutes, plus a few more to write it back out to disk. I know this because I just generated half a million phony subscribers, and I wanted to download the list to my Sony Reader so I could scan through the output. I know that all have unique phone numbers, but I wanted to see how plausible they look. So far, not bad.

The (updated! yeah!) Sony Reader also handles the 92,600-page PDF file very nicely.

[Update: I should note that the “hook it up” part I’m referring to here is the web-based activation process. The actual “50,000 boxes connect to our servers and start making phone calls” part is something we can predict quite nicely based on the data from the thousands of boxes already in the field.]

Saturday, August 16 2008

Why is it suddenly too late for CADS?

For some time now, I’ve been writing Perl scripts to work with kanji. While Perl supports transparent conversion from pretty much any character encoding, internally it’s all Unicode, and since that’s also the native encoding for a Mac, all is well.

Except that for backwards compatibility, Perl doesn’t default to interpreting all input and output as Unicode. There are still lots of scripts out there that assume all operations work in terms of 8-bit characters, and you don’t want to silently corrupt their results.

The solution created in 5.8 was the -C option, which has half a dozen or so options for deciding exactly what should be treated as Unicode. I use -CADS, which Uni-fies @ARGV, all open calls for input/output, and STDIN, STDOUT, and STDERR.

Until today. My EEE is running Fedora 9, and when I copied over my recently-rewritten dictionary tools, they refused to run, insisting that it’s ‘Too late for “-CADS” option at lookup line 1’. In Perl 5.10, you can no longer specify Unicode compatibility level on a script-by-script basis. It’s all or nothing, using the PERL_UNICODE environment variable.

That, as they say, sucks.

The claim in the release notes is:

The -C option can no longer be used on the #! line. It wasn’t working there anyway, since the standard streams are already set up at this point in the execution of the perl interpreter. You can use binmode() instead to get the desired behaviour.

But this isn’t true, since I’ve been running gigs of kanji in and out of Perl for years this way, and they didn’t work without either putting -CADS into the #! line or crufting up the scripts with explicitly specified encodings. Obviously, I preferred the first solution.

In fact, just yesterday I knocked together a quick script on my Mac to locate some non-kanji characters that had crept into someone’s kanji vocabulary list, using \p{InCJKUnifiedIdeographs}, \p{InHiragana}, and \p{InKatakana}. I couldn’t figure out why it wasn’t working correctly until I looked at the #! line and realized I’d forgotten the -CADS.

What I think the developer means in the release notes is that it didn’t work on systems with a non-Unicode default locale. So they broke it everywhere.

Thursday, September 4 2008

Upgrading Movable Type

The machine this site runs on hasn’t been updated in a while. The OS is old, but it’s OpenBSD, so it’s still secure. Ditto for Movable Type; I’m running an old, stable version that has some quirks, but hasn’t needed much maintenance. I don’t even get any comment spam, thanks to a few simple tricks.

There are some warts, though. Rebuild times are getting a bit long, my templates are a bit quirky, and Unicode support is just plain flaky, both in the old version of Perl and in the MT scripts. This also bleeds over into the offline posting tool I use, Ecto, which occasionally gets confused by MT and converts kanji into garbage.

Fixing all of that on the old OS would be harder than just upgrading to the latest version of OpenBSD. That’s a project that requires a large chunk of uninterrupted time, and we’re building up to a big holiday season at work, so “not right now”.

I need an occasional diversion from work and Japanese practice, though, and redesigning this blog on a spare machine will do nicely. I can also move all of my Mason apps over, and take advantage of the improved Unicode support in modern Perl to do something interesting. (more on that later)

(Continued on Page 3098)

Sunday, September 21 2008

Dictionaries as toys

There are dozens of front-ends for Jim Breen’s Japanese-English and Kanji dictionaries, on and offline. Basically, if it’s a software-based dictionary that wasn’t published in Japan, the chance that it’s using some version of his data is somewhere above 99%.

Many of the tools, especially the older or free ones, use the original Edict format, which is compact and fairly easy to parse, but omits a lot of useful information. It has a lot of words you won’t find in affordable J-E dictionaries, but the definitions and usage information can be misleading. One of my Japanese teachers recommends avoiding it for anything non-trivial, because the definitions are extremely terse, context-free, and often “off”.

(Continued on Page 3125)

Tuesday, September 23 2008

More toying with dictionaries

[Update: the editing form is now hooked up to the database, in read-only mode. I’ve linked some sample entries on it. …and now there’s a link from the dictionary page; it’s still read-only, but you can load the results of any search into the form]

I feel really sorry for anyone who edits XML by hand. I feel slightly less sorry for people who use editing tools that can parse DTDs and XSDs and validate your work, but still, it just strikes me as a bad idea. XML is an excellent way to get structured data out of one piece of software and into a completely different one, but it’s toxic to humans.

JMdict is well-formed XML, maintained with some manner of validating editor (update: turns out there’s a simple text format based on the DTD that’s used to generate valid XML), but editing it is still a pretty manual job, and getting new submissions into a usable format can’t be fun. The JMdictDB project aims to help out with this, storing everything in a database and maintaining it with a web front-end.

Unfortunately, the JMdict schema is a poor match for standard HTML forms, containing a whole bunch of nested optional repeatable fields, many of them entity-encoded. So they punted, and relied on manually formatting a few TEXTAREA fields. Unless you’re new here, you’ll know that I can’t pass up a scripting problem that’s just begging to be solved, even if no one else in the world will ever use my solution.

So I wrote a jQuery plugin that lets you dynamically add, delete, and reorder groups of fields, and built a form that accurately represents the entire JMdict schema. It’s not hooked up to my database yet, and submitting it just dumps out the list of fields and values. It’s also ugly, with crude formatting and cryptic field names (taken from the schema), but the basic idea is sound. I was pleased that it only took one page of JavaScript to add the necessary functionality.

[hours to debug that script, but what can you do?]

Friday, August 28 2009

Win one for the Zipper!

Zip files are created with the user’s local character-set encoding. For most people in Japan, this means Shift-JIS. For most people outside of Japan, this means that you’ll get garbage file names when you unzip the file, unless you use a tool that supports manually overriding the encoding.

I couldn’t find a decent character-set-aware unzip for the Mac, so I installed the Archive::Zip module from CPAN and used the Encode module to do the conversion. Bare-bones Perl script follows.

(Continued on Page 3402)

Monday, October 11 2010

From PDF::API2::Lite to PDF::Haru

There are no circles in PDF. Thought you saw a circle in a PDF file? Nope, you saw a series of Beziér cubic splines approximating a circle. This has never been a problem for me before, and I’ve cheerfully instructed the PDF::API2::Lite module to render dozens, nay hundreds of these “circles” on a single page, and it has always worked.

Then I tried rendering a few tens of thousands of them, and heard the fans on my laptop spin up. PDF::API2 is a pure-perl module, you see, and Perl is not, shall we say, optimized for trig. PDF::Haru, on the other hand, is a thin wrapper around Haru, which is written in C. Conversion took only a few minutes, which is about a tenth of the time the script would have needed to finish rendering, and the new version took 15 seconds to render a 1:50,000,000 scale Natural Earth basemap and all the data points.

I ended up abandoning “circles” for squares anyway, though, because PDF viewers aren’t happy with them in those quantities, either. Still faster to do it with PDF::Haru, so the time wasn’t wasted.

As a bonus, Haru has support for rendering vertical text in Japanese. I can think of a few uses for that in my other projects.

(I should note that it’s not useful for every project, and in most respects PDF::API2 is a more complete implementation of the spec, but for straightforward image rendering, it’s a lot faster. Development seems to have mostly stopped as well, so if it doesn’t do what you want today, it likely never will)

Friday, January 14 2011


[Update: significantly improved the Perl script]

The hardest part of my cunning plan isn’t “making a screensaver”; I think every OS comes with a simple image-collage module that will do that. The fun was in collecting the images and cleaning them up for use.

Amazon’s static preview images are easy to copy; just click and save. For the zoom-enabled previews, it will fall back to static images if Javascript is disabled, so that’s just click and save as well. Unfortunately, there are an increasing number of “look inside” previews (even in the naughty-novels genre) which have no fallback, and which are not easily extracted; for now, I’ll write those off, even though some of them have gorgeous cover art.

[Update: turns out it’s easy to extract previews for the “look inside” style; just open the thumbnail in a new window, and replace everything after the first “.” with “_SS500_.jpg”.]

A bit of clicking around assembled a set of 121 pleasant images at more-or-less the same scale, with only one problem: large white borders. I wanted to crop these out, but a simple pass through ImageMagick’s convert -trim filter would recompress the JPEGs, reducing the quality. You can trim off almost all of the border losslessly with jpegtran -crop, but it doesn’t auto-crop; you have to give it precise sizes and offsets.

So I used both:

crop=$(convert -trim $f -format "%wx%h%X%Y" info:-)
jpegtran -crop $crop $f > crop/$f

So, what does a collage of naughty-novel cover-art look like? Here are some samples (warning: 280K 1440x900 JPEGs): 1, 2, 3, 4, 5. [Update: larger covers, and full-sized]

These were not, in fact, generated by taking screenshots of the screensaver. It takes a long time for it to fill up all the blank spots on the screen, so I wrote a small Perl script that uses the GD library to create a static collage using the full set of images. If I desaturated the results a bit, it would make a rather lively background screen. For home use.

Thursday, April 28 2011

Clean, simple code

I don’t know why people say Perl is hard to understand…

$\=$/ and print join $",reverse map scalar reverse,qw,foo bar baz,

Saturday, June 18 2011

Converting FLAC audio to MP3 in an MKV video

Let’s say that you have somehow acquired a video in MKV format, where for no particularly good reason the creator has chosed to encode the audio as FLAC (we shall neglect for the moment their poor taste in embedded fonts for animated karaoke and special-effect subtitling).

If for device-compatibility reasons you would prefer a better-supported audio format like MP3, and you’d really rather not re-encode the video to MP4 with hardsubs, the simplest solution is to extract the FLAC audio with mkvextract (part of Mkvtoolnix), decode it to WAV with Flac, encode it to MP3 with Lame, and then reinsert it with mkvmerge.

You also have to figure out which audio track, if any, is FLAC-encoded, but mkvinfo will do that for you, in a relatively-sane format. I have of course automated the whole task with a small Perl script.

Finding a video player that can smoothly scrub forward and backward through an MKV video for screenshots is left as an exercise in frustration for the reader.

Wednesday, September 7 2016

Geo::OLC Perl module

[Update: now on CPAN]

There are multiple competing algorithms for converting latitude/longitude locations into something easier for humans to work with. Some of them are proprietary, which makes them pretty useless offline or after that company goes out of business.

Google came up with a plausible rationale for inventing their own Open Location Code rather than adopting geohash or something similar. It’s Open Source up on Github, and they supply APIs for several common languages.

But not Perl. Naturally, I had to fix that, so here’s Geo::OLC (compressed tarball). It has a full test suite, a simple command-line tool, and a CGI script that generates a dynamic labeled grid for Google Earth. I think I’ve about got it cleaned up enough to put it on CPAN.

For amusement, the sample location I use in the POD documentation, 8Q6QMG93+742, is Tenka Gyoza in Osaka, which is exactly the sort of place that you’d have trouble finding with standard addressing methods. Actually, you’d have trouble finding it with Google Maps on your phone, as my sister learned.

I tried not to go overboard with Perlisms, but the code still ended up fairly compact, largely because most of the existing APIs were written while the formatting was still in flux, so they’re more generic than necessary.

[Update: turns out someone did write one, but it never made it to CPAN: Geo::OpenLocationCode. Looks like he converted one of the other APIs instead of writing it from scratch, so he inherited the same bug in recover_nearest.]