Perl

Flashcards revisited


A while back, I made quick and dirty hiragana flashcards, using the Mac OS X print dialog to print single-word pages 16-up. As my Japanese class moves along, though, there’s a need for something more sophisticated. Each lesson in our book includes a number of kanji words that will be on the test, and while my previous method will work, the hard-coded font sizes and word placement get messy to maintain.

If I’m going to write an honest-to-gosh flashcard generator, though, I might as well go whole-hog and make it capable of printing study words vertically, the way they’d be printed in a book or newspaper. Learning to recognize horizontal text might get me through the test, but it’s not enough for real Japanese literacy.

Here’s the Perl script (requires PDF::API2::Lite), a horizontal example, and vertical example. You’ll need to supply the name of your own TrueType/OpenType font that includes the kanji, unless you happen to have a copy of Kozuka Mincho Pro Regular around the house.

Note that the above PDF files have been significantly reduced in size (by an order of magnitude!) by using Mac OS X’s Preview app and saving them with the Quartz filter “Reduce File Size”. The words in the sample are from the review sheet for this week’s lesson…

Update: One problem with my vertical-printing solution quickly became obvious, and I don’t have a good solution for it. The short version is “Unicode is meaning, not appearance”, so variant glyphs can’t be easily selected, even if they’re present in your font. Specifically, the katakana prolonged-sound mark 「ー」 should be a vertical line when you’re writing vertically. Also, all of the small kana 「ぁぃぅぇぉっ」 should be offset up and to the right, and good fonts include correct variants, but I can cheat on that one; I just need to move the glyph, not change its shape.

No one seems to have figured out the necessary font-encoding tricks to pull this off with PDF::API2. At least, it’s not turning up in any google incantation I try, which leaves me with one conceptually disgusting workaround: rotate and flip. Calligraphers and type-weenies will cringe, but at text sizes it will pass. The correct character is on the left:

vertical kana hack

Now to write the code for both workarounds…

[side note: Adobe’s premier software suite is remarkably fragile; I just got it into a state where I couldn’t run Photoshop. How? I started Illustrator, which opened Adobe’s software-update tool in the background, then quit Illustrator. When I started Photoshop, it tried to open the update tool again, couldn’t, and crashed.]

The cleansing power of Quartz


So my new Japanese class has started (lousy classroom, good teacher, reasonable textbook, nice group of students, unbelievably gorgeous teacher’s assistant (I will never skip class…)), and, as expected, the teacher is pushing us to master hiragana quickly. I did that quite a while ago, so while everyone else is trying to learn it from scratch, I can focus on improving my handwriting.

One thing she suggested was a set of flash cards. I had mine with me, and mentioned that they were available for a quite reasonable price at Kinokuniya. Her response was along the lines of “yes, I know, but nobody ever buys optional study materials; do you think you could photocopy them so I can make handouts?”

I could, but that wouldn’t be nearly as much fun as making my own set. The first step was finding a decent kana font. Mac OS X ships with several Unicode fonts that include the full Japanese kana and kanji sets, but they didn’t meet my needs: looking good at display sizes, and clearly showing the boundaries between strokes. I found Adobe Ryo Display Standard. TextEdit seems to be a bit confused about its line-height, but I wasn’t planning to create the cards in that app anyway.

How to generate the card images? Perl, of course, with the PDF::API2::Lite module. I could have written a script that calculated the correct size of cards to fill the page, but I was feeling lazy, so I wrote a 12-line script to put one large character in the middle of a page, loaded the results into Preview, set the print format to 16-up with a page border, and printed to a new PDF file. Instant flash cards.

For many people, this would be sufficient, but one of the things sensei liked about the cards I had brought was the numbers and arrows that indicated the correct stroke order. There was no lazy way to do this, so I used Adobe Acrobat’s drawing and stamping tools. The stamping tool lets you quickly decorate a PDF file with images in many formats, so I just modified my previous script to create PDF files containing single numbers, and imported them into the stamp tool. The line-drawing tool let me make arrows, although I couldn’t figure out a simple way to set my own line-width and have it remembered (1pt was too thin after the 16-up, and 2pt had too-big arrowheads).

So why is this post titled “the cleansing power of Quartz”? Because the one-per-page annotated output from Acrobat was more than six times larger than the same file printed 16-up from Preview. Just printing the original file back to PDF shrank it by a factor of four, which, coincidentally, is almost exactly what you get when you run gzip on it…

The final results are here.

Furigana-blogging


Every time I include some Japanese text in a blog entry, I’m torn between adding furigana and, well, not. It’s extremely useful for people who don’t read kanji well, but it’s tedious to do by hand in HTML. At the same time, I find myself wishing that my Rosetta Stone courseware included furigana, so that I could hover the mouse over a word and see the pronunciation instead of switching from kanji to kana mode and back. I’d also like to see their example phrases in a better font, at higher resolution.

80 lines of Perl later:

女の人泳いでいます。
男の人滑り落ちています。
男の子転んでいます。
男の子泳いでいます。

泳いでいます。
飛んでいます。
雄牛走っています。
泳いでいます。

[update: I tested this under IE on my Windows machine at work, and it correctly displayed the pop-up furigana, but ignored the CSS that highlighted the word it applied to; apparently my machine has extra magic installed, because the pop-up doesn’t use a Unicode font for some people. Sigh. Found! fixing tooltips in IE (about halfway down the page)]

more...

Why I prefer Perl to JavaScript, reason #3154


For amusement, I decided that my next Dashboard gadget should be a tool for looking up characters in KANJIDIC using Jack Halpern’s SKIP system.

SKIP is basically a hash-coding system for ideographs that doesn’t rely on extensive knowledge of how they’re constructed. Once you’ve figured out how to count strokes reliably, you simply break the character into two parts according to one of several patterns, and count the number of strokes in each part. It’s not quite that simple, but almost, and it’s a lot more novice-friendly than traditional lookup methods.

Downside? The simplicity of the system results in a large number of hash collisions (only 584 distinct SKIP codes for the 6,355 characters in KANJIDIC). In the print dictionaries the system was designed for, this is handled by grouping together entries that share the same first part. Conveniently, unicode sorting seems to produce much the same effect, although a program can’t identify the groups without additional information. A simple supplementary index can easily be constructed for the relatively few SKIP codes with an absurd collision count (1-3-8 is the biggest, at 161), so it’s feasible to create a DHTML form that lets you locate any unknown kanji by just selecting from a few pulldown menus.

For various reasons, it just wasn’t a good idea to attempt to parse KANJIDIC directly from JavaScript (among other things, everything is encoded in EUC-JP instead of UTF-8), so I quickly knocked together a Perl script that read the dictionary into a SKIP-indexed data structure, and wrote it back out as a JavaScript array initialization.

Which didn’t work the first time, because, unlike Perl, you can’t have trailing commas in array or object literals. That is, this is illegal:

var skipcode = [
  {
    s1:{
      s1:['儿','八',],
      s2:['小','巛','川',],
      s3:['心','水',],
      s4:['必','旧',],
      s7:['承',],
      s8:['胤',],
      s11:['順',],
    },
  },
];

Do you know how annoying it is to have to insert extra code for “add a comma unless you’re the last item at this level” when you’re pretty-printing a complex data structure? Yes, I’m sure there are all sorts of good reasons why you shouldn’t allow those commas to exist, but gosh-darnit, they’re convenient!

iTMS weekly reports


No, I didn’t buy another big batch of music from the iTunes Music Store yet, although I probably will soon, to stock up the iPod for my next road trip to Las Vegas. I have been keeping an eye on the store, though, and after corresponding with Brian Tiemann, I decided to investigate an oddity we’d both noticed: the week-by-week “Just Added” report ain’t no such thing.

more...

The Perl Script From Hell


I’ve been working with Perl since about two weeks before version 2.0 was released. Over those fifteen years, I’ve seen a lot of hairy Perl scripts, many of them mine.

None of them can compare to the monster that lurks in the depths of our service, though. Over 8,000 lines of Perl plus an 8,000-line C++ module, written in a style that’s allegedly Object Oriented, but which I would describe as Obscenely Obfuscated (“Hi, Andrew!”).

We have five large servers devoted to running it. Each contributes three CPUs, three gigabytes of memory, and 25 hours of runtime to the task (independently; we need the redundancy if one of them crashes). Five years ago, I swore a mighty oath to never, ever get involved with the damned thing.

Then it broke. In a way that involved tens of thousands of unhappy customers.

more...

PDF bullseye target generator


Spent two days this week at an Operations forum up north, and since most of the sessions had very little to do with the service I operate, I was able to do some real work while casually keeping track of the discussions.

My online target archive contains a bunch of bullseye targets I built using a Perl script. The native output format was PostScript, ’cause I like PostScript, but PDF is generally more useful today, and not everyone uses 8.5×11 paper. I hand-converted some of them, but never finished the job.

The correct solution was to completely rewrite the code using the PDF::API2::Lite Perl module, and generalize it for different paper sizes and multiple targets per page. It’s still a work in progress, but already pretty useful.

HTML forms suck


It didn’t shock me to discover this, but it was one of those things about the web that I hadn’t really played with seriously. Then I started trying to expose all of the parameters for my random web colors page, so people could tinker with the color-generation rules.

Not only did the form add 24K to the page size, it increased the rendering time by about an order of magnitude.

more...

“Need a clue, take a clue,
 got a clue, leave a clue”