This is a public braindump, to help out anyone who might want to parse Japanese text without being sufficiently fluent in Japanese to read technical documentation. Our weapon of choice will be the morphological analysis tool MeCab, and for sanity’s sake, we’ll do everything in UTF8-encoded Unicode.
The goal will be to take a plain text file containing Japanese text, one paragraph per line, and extract every word in its dictionary form, with the correct reading, with little or no manual intervention.
First, yet another update to the collage-making Perl script. I added a number of tweaks so that I could reduce blank spots and overlap. If my interest holds up, I may add code to search for the largest empty areas more intelligently, but the current version works pretty well, and only took 13 seconds to generate a 1024x4096 collage of 481 naughty-book covers at 36% of their original size.
Warning! Clicking on the thumbnail loads a 1 megabyte JPEG that is unlikely to be work-safe:
Yes, I’m up to 481 covers. Some of them are second-rate, and some of the titles reflect subject matter less innocent than the cover images, but there’s a lot of terrific cheesecake in there. I’m probably going to track down the names of several of the artists so I can look for collections of their work.
One thing I didn’t do as I was idly gathering covers was keep track of the original pages on Amazon, so if I want to do something with the book titles, I have to type them back in by hand. This isn’t too bad, since I can sketch most unfamiliar kanji on my laptop’s trackpad, and find most of the remaining oddballs with Ben Bullock’s Multi-radical search tool, but it turns out that for many of the books, the really spicy bits are in the sub-titles, which are often too small to read. In some cases, the large title is just the series name.
Still, I’ve done enough to confidently state that at least one of the following words appears on the cover of almost every naughty novel in Japan. Sometimes three at once, with three at once. Cut out and save this handy guide!
[Update: significantly improved the Perl script]
The hardest part of my cunning plan isn’t “making a screensaver”; I think every OS comes with a simple image-collage module that will do that. The fun was in collecting the images and cleaning them up for use.
Amazon’s static preview images are easy to copy; just click and save. For the zoom-enabled previews, it will fall back to static images if Javascript is disabled, so that’s just click and save as well. Unfortunately, there are an increasing number of “look inside” previews (even in the naughty-novels genre) which have no fallback, and which are not easily extracted; for now, I’ll write those off, even though some of them have gorgeous cover art.
[Update: turns out it’s easy to extract previews for the “look inside” style; just open the thumbnail in a new window, and replace everything after the first “.” with “SS500.jpg”.]
A bit of clicking around assembled a set of 121 pleasant images at more-or-less the same scale, with only one problem: large white borders. I wanted to crop these out, but a simple pass through ImageMagick’s convert -trim filter would recompress the JPEGs, reducing the quality. You can trim off almost all of the border losslessly with jpegtran -crop, but it doesn’t auto-crop; you have to give it precise sizes and offsets.
So I used both:
f=foo.jpg
crop=$(convert -trim $f -format "%wx%h%X%Y" info:-)
jpegtran -crop $crop $f > crop/$f
So, what does a collage of naughty-novel cover-art look like? Here are some samples (warning: 280K 1440x900 JPEGs): 1, 2, 3, 4, 5. [Update: larger covers, and full-sized]
These were not, in fact, generated by taking screenshots of the screensaver. It takes a long time for it to fill up all the blank spots on the screen, so I wrote a small Perl script that uses the GD library to create a static collage using the full set of images. If I desaturated the results a bit, it would make a rather lively background screen. For home use.
I often say that I’m not a programmer, I’m a problem-solver who occasionally writes code to eliminate annoyances. One recent annoyance was what passes for “state of the art” in creating star maps for the Traveller RPG.
Sunday was a pretty slow day, so I wrote a Perl script that generated PDF hex-maps for use in the Traveller RPG (we’re starting a D20 Traveller campaign soon). I also integrated star-system data from the standard SEC format that’s been passed around on the Internet for many years, and I’m adding an assortment of features as I find time.
Currently it prints at the sector, quadrant, and subsector level, in color and b&w, on paper sizes ranging from 4x6 to 11x17. All the heavy lifting is done with the PDF::API2::Lite module from CPAN, which has a straightforward interface.
Update: I seem to have pretty good page-rank with Google, so just in case there’s anyone else out there who’s trying to set a clipping region with PDF::API2::Lite, the magic words are:
#create some kind of path, like so $pdf->rectxy($x1,$y1,$x2,$y2); #clip to it $pdf->{hybrid}->clip; #start a new path $pdf->{hybrid}->endpath;
Note that this doesn’t seem to work with the alpha 0.40 versions of the PDF::API2 distribution. I’m using 0.3r77.
Some of my friends are starting to wear pro-Bush t-shirts more often, which has produced some hilarious results when they’re out in public. My favorite was at a gaming convention a few months back, when a hotel employee took one look at what Rory was wearing and said “you’re not serious, are you?”.
Unfortunately, I haven’t been able to find a design that I liked. So I’m working on my own. First candidate:
I’m not really a programmer; I’ve been a Perl hacker since ’88, though, after discovering v1.010 and asking Larry Wall where the rest of the patches were (his reply: “wait a week for 2.0”). If I’m anything, I’m a toolsmith; I mostly write small programs to solve specific problems, and usually avoid touching large projects unless they’re horribly broken in a way that affects me, and no one else can be persuaded to fix them on my schedule.
So what does this have to do with learning Japanese? Everything. I’m in the early stages of a self-study course (the well-regarded Rosetta Stone software; “ask me how to defeat their must-insert-CD-to-run copy-protection”), and authorities agree that you must learn to read using the two phonetic alphabets, Hiragana (ひらがな, used for native Japanese words) and Katakana (カタカナ, used for foreign words). A course that’s taught using Rōmaji (phonetic transcriptions using roman characters) gives you habits that will have no value in real life; Rōmaji is not used for much in Japan.
So how do you learn two complete sets of 46 symbols plus their variations and combinations, as well as their correct pronunciations? Flashcards!
The best software I’ve found for this is a Classic-only Mac application called Kana Lab (link goes direct to download), which has a lot of options for introducing the character sets, and includes recordings of a native speaker pronouncing each one. I’ve also stumbled across a number of Java and JavaScript kana flashcards, but the only one that stood out was LanguageBug, which works on Java cellphones (including my new Motorola v600).
When the misconceptions about Apple’s upcoming Dashboard feature in OS X 10.4 were cleared up (sorry, Konfabulator, it will kill your product not by being a clone, but simply by being better), I acquired a copy of the beta (why, yes, I am a paid-up member of the Apple Developer Connection) and took a look, with the goal of building a functional, flexible flashcard gadget.
Unfortunately, I’ve spent the past few years stubbornly refusing to learn JavaScript and how it’s used to manipulate HTML using the DOM, so I had to go through a little remedial course. I stopped at a Barnes & Noble on Sunday afternoon and picked up the O’Reilly JavaScript Pocket Reference and started hacking out a DHTML flashcard set, using Safari 1.2 under Panther as the platform.
Note: TextEdit and Safari do not a great DHTML IDE make. It worked, but it wasn’t fast or pretty, especially for someone who was new to JavaScript and still making stupid coding errors.
I got it working Tuesday morning, finished off the configuration form Wednesday afternoon, and squashed a few annoying bugs Wednesday night. Somewhere in there I went to work. If you’re running Safari, you can try it out here; I’ve made no attempt to cater to non-W3C DOM models, so it won’t work in Explorer or Mozilla.
There’s a lot more it could do, but right now you can select which character sets to compare, which subsets of them to include in the quiz, and you can make your guesses either by clicking with the mouse or pressing the 1-4 keys on the keyboard. I’ve deliberately kept the visual design simple, not just because I’m not a graphic designer, but also to show how Apple’s use of DHTML as the basis for gadgets makes it possible for any experienced web designer to come in and supply the chrome.
So what does it take to turn my little DHTML web page into a Dashboard gadget?
My web color scheme generator is currently set up to reflect my own biases. The results are almost always readable, even for people with various forms of color-blindness, but who’s to say that my way is best?
Well, me, of course, but once or twice a year I’m willing to admit that I might be wrong about something. In recognition of that possibility, I’ll explain the syntax for the mini-language I created for the generator.