Perl

Collaging


[Update: significantly improved the Perl script]

The hardest part of my cunning plan isn’t “making a screensaver”; I think every OS comes with a simple image-collage module that will do that. The fun was in collecting the images and cleaning them up for use.

Amazon’s static preview images are easy to copy; just click and save. For the zoom-enabled previews, it will fall back to static images if Javascript is disabled, so that’s just click and save as well. Unfortunately, there are an increasing number of “look inside” previews (even in the naughty-novels genre) which have no fallback, and which are not easily extracted; for now, I’ll write those off, even though some of them have gorgeous cover art.

[Update: turns out it’s easy to extract previews for the “look inside” style; just open the thumbnail in a new window, and replace everything after the first “.” with “SS500.jpg”.]

A bit of clicking around assembled a set of 121 pleasant images at more-or-less the same scale, with only one problem: large white borders. I wanted to crop these out, but a simple pass through ImageMagick’s convert -trim filter would recompress the JPEGs, reducing the quality. You can trim off almost all of the border losslessly with jpegtran -crop, but it doesn’t auto-crop; you have to give it precise sizes and offsets.

So I used both:

f=foo.jpg
crop=$(convert -trim $f -format "%wx%h%X%Y" info:-)
jpegtran -crop $crop $f > crop/$f

So, what does a collage of naughty-novel cover-art look like? Here are some samples (warning: 280K 1440x900 JPEGs): 1, 2, 3, 4, 5. [Update: larger covers, and full-sized]

These were not, in fact, generated by taking screenshots of the screensaver. It takes a long time for it to fill up all the blank spots on the screen, so I wrote a small Perl script that uses the GD library to create a static collage using the full set of images. If I desaturated the results a bit, it would make a rather lively background screen. For home use.

From PDF::API2::Lite to PDF::Haru


There are no circles in PDF. Thought you saw a circle in a PDF file? Nope, you saw a series of Beziér cubic splines approximating a circle. This has never been a problem for me before, and I’ve cheerfully instructed the PDF::API2::Lite module to render dozens, nay hundreds of these “circles” on a single page, and it has always worked.

Then I tried rendering a few tens of thousands of them, and heard the fans on my laptop spin up. PDF::API2 is a pure-perl module, you see, and Perl is not, shall we say, optimized for trig. PDF::Haru, on the other hand, is a thin wrapper around Haru, which is written in C. Conversion took only a few minutes, which is about a tenth of the time the script would have needed to finish rendering, and the new version took 15 seconds to render a 1:50,000,000 scale Natural Earth basemap and all the data points.

I ended up abandoning “circles” for squares anyway, though, because PDF viewers aren’t happy with them in those quantities, either. Still faster to do it with PDF::Haru, so the time wasn’t wasted.

As a bonus, Haru has support for rendering vertical text in Japanese. I can think of a few uses for that in my other projects.

(I should note that it’s not useful for every project, and in most respects PDF::API2 is a more complete implementation of the spec, but for straightforward image rendering, it’s a lot faster. Development seems to have mostly stopped as well, so if it doesn’t do what you want today, it likely never will)

Win one for the Zipper!


Zip files are created with the user’s local character-set encoding. For most people in Japan, this means Shift-JIS. For most people outside of Japan, this means that you’ll get garbage file names when you unzip the file, unless you use a tool that supports manually overriding the encoding.

I couldn’t find a decent character-set-aware unzip for the Mac, so I installed the Archive::Zip module from CPAN and used the Encode module to do the conversion. Bare-bones Perl script follows.

more...

More toying with dictionaries


[Update: the editing form is now hooked up to the database, in read-only mode. I’ve linked some sample entries on it. …and now there’s a link from the dictionary page; it’s still read-only, but you can load the results of any search into the form]

I feel really sorry for anyone who edits XML by hand. I feel slightly less sorry for people who use editing tools that can parse DTDs and XSDs and validate your work, but still, it just strikes me as a bad idea. XML is an excellent way to get structured data out of one piece of software and into a completely different one, but it’s toxic to humans.

JMdict is well-formed XML, maintained with some manner of validating editor (update: turns out there’s a simple text format based on the DTD that’s used to generate valid XML), but editing it is still a pretty manual job, and getting new submissions into a usable format can’t be fun. The JMdictDB project aims to help out with this, storing everything in a database and maintaining it with a web front-end.

Unfortunately, the JMdict schema is a poor match for standard HTML forms, containing a whole bunch of nested optional repeatable fields, many of them entity-encoded. So they punted, and relied on manually formatting a few TEXTAREA fields. Unless you’re new here, you’ll know that I can’t pass up a scripting problem that’s just begging to be solved, even if no one else in the world will ever use my solution.

So I wrote a jQuery plugin that lets you dynamically add, delete, and reorder groups of fields, and built a form that accurately represents the entire JMdict schema. It’s not hooked up to my database yet, and submitting it just dumps out the list of fields and values. It’s also ugly, with crude formatting and cryptic field names (taken from the schema), but the basic idea is sound. I was pleased that it only took one page of JavaScript to add the necessary functionality.

[hours to debug that script, but what can you do?]

Dictionaries as toys


There are dozens of front-ends for Jim Breen‘s Japanese-English and Kanji dictionaries, on and offline. Basically, if it’s a software-based dictionary that wasn’t published in Japan, the chance that it’s using some version of his data is somewhere above 99%.

Many of the tools, especially the older or free ones, use the original Edict format, which is compact and fairly easy to parse, but omits a lot of useful information. It has a lot of words you won’t find in affordable J-E dictionaries, but the definitions and usage information can be misleading. One of my Japanese teachers recommends avoiding it for anything non-trivial, because the definitions are extremely terse, context-free, and often “off”.

more...

Upgrading Movable Type


The machine this site runs on hasn’t been updated in a while. The OS is old, but it’s OpenBSD, so it’s still secure. Ditto for Movable Type; I’m running an old, stable version that has some quirks, but hasn’t needed much maintenance. I don’t even get any comment spam, thanks to a few simple tricks.

There are some warts, though. Rebuild times are getting a bit long, my templates are a bit quirky, and Unicode support is just plain flaky, both in the old version of Perl and in the MT scripts. This also bleeds over into the offline posting tool I use, Ecto, which occasionally gets confused by MT and converts kanji into garbage.

Fixing all of that on the old OS would be harder than just upgrading to the latest version of OpenBSD. That’s a project that requires a large chunk of uninterrupted time, and we’re building up to a big holiday season at work, so “not right now”.

I need an occasional diversion from work and Japanese practice, though, and redesigning this blog on a spare machine will do nicely. I can also move all of my Mason apps over, and take advantage of the improved Unicode support in modern Perl to do something interesting. (more on that later)

more...

Why is it suddenly too late for CADS?


For some time now, I’ve been writing Perl scripts to work with kanji. While Perl supports transparent conversion from pretty much any character encoding, internally it’s all Unicode, and since that’s also the native encoding for a Mac, all is well.

Except that for backwards compatibility, Perl doesn’t default to interpreting all input and output as Unicode. There are still lots of scripts out there that assume all operations work in terms of 8-bit characters, and you don’t want to silently corrupt their results.

The solution created in 5.8 was the -C option, which has half a dozen or so options for deciding exactly what should be treated as Unicode. I use -CADS, which Uni-fies @ARGV, all open calls for input/output, and STDIN, STDOUT, and STDERR.

Until today. My EEE is running Fedora 9, and when I copied over my recently-rewritten dictionary tools, they refused to run, insisting that it’s ‘Too late for “-CADS” option at lookup line 1’. In Perl 5.10, you can no longer specify Unicode compatibility level on a script-by-script basis. It’s all or nothing, using the PERL_UNICODE environment variable.

That, as they say, sucks.

The claim in the release notes is:

The -C option can no longer be used on the #! line. It wasn't working there anyway, since the standard streams are already set up at this point in the execution of the perl interpreter. You can use binmode() instead to get the desired behaviour.

But this isn’t true, since I’ve been running gigs of kanji in and out of Perl for years this way, and they didn’t work without either putting -CADS into the #! line or crufting up the scripts with explicitly specified encodings. Obviously, I preferred the first solution.

In fact, just yesterday I knocked together a quick script on my Mac to locate some non-kanji characters that had crept into someone’s kanji vocabulary list, using \p{InCJKUnifiedIdeographs}, \p{InHiragana}, and \p{InKatakana}. I couldn’t figure out why it wasn’t working correctly until I looked at the #! line and realized I’d forgotten the -CADS.

What I think the developer means in the release notes is that it didn’t work on systems with a non-Unicode default locale. So they broke it everywhere.

Make More People!


I’m doing some load-testing for our service, focusing first on the all-important Christmas Morning test: what happens when 50,000 people unwrap their presents, find your product, and try to hook it up. This was a fun one at WebTV, where every year we rented CPUs and memory for our Oracle server, and did a complicated load-balancing dance to support new subscribers while still giving decent response to current ones. [Note: it is remarkably useful to be able to throw your service into database-read-only mode and point groups of hosts at different databases.]

My first problem was deciphering the interface. I’ve never worked with WSDL before, and it turns out that the Perl SOAP::WSDL package has a few quirks related to namespaces in XSD schemas. Specifically, all of the namespaces in the XSD must be declared in the definition section of the WSDL to avoid “unbound prefix” errors, and then you have to write a custom serializer to reinsert the namespaces after wsdl2perl.pl gleefully strips them all out for you.

Once I could register one phony subscriber on the test service, it was time to create thousands of plausible names, addresses, and (most importantly) phone numbers scattered around the US. Census data gave me a thousand popular first and last names, as well as a comprehensive collection of city/state/zip values. Our CCMI database gave me a full set of valid area codes and prefixes for those zips. The only thing I couldn’t find a decent source for was street names; I’m just using a thousand random last names for now.

I’m seeding the random number generator with the product serial number, so that 16728628 will always be Elisa Wallace on W. Westrick Shore in Crenshaw, MS 38621, with a number in the 662 area code.

Over the next few days, I’m going to find out how many new subscribers I can add at a time without killing the servers, as well as how many total they can support without exploding. It should be fun.

Meanwhile, I can report that Preview.app in Mac OS X 10.5.4 cheerfully handles converting a 92,600-page PostScript file into PDF. It took about fifteen minutes, plus a few more to write it back out to disk. I know this because I just generated half a million phony subscribers, and I wanted to download the list to my Sony Reader so I could scan through the output. I know that all have unique phone numbers, but I wanted to see how plausible they look. So far, not bad.

The (updated! yeah!) Sony Reader also handles the 92,600-page PDF file very nicely.

[Update: I should note that the “hook it up” part I’m referring to here is the web-based activation process. The actual “50,000 boxes connect to our servers and start making phone calls” part is something we can predict quite nicely based on the data from the thousands of boxes already in the field.]

“Need a clue, take a clue,
 got a clue, leave a clue”