Projects

Unicode Code 2 PDF


“Now witness the firepower of this fully armed and operational battlestation enscript replacement”.

Back in the day, when OSU had licenses for Adobe TranScript to drive all the laser printers on campus, I was pretty much the only person who really understood how it worked. So much so that after I left for California, the Physics department’s sysadmin gave me an account on his servers to help him get a new version to work.

Somewhere along the line, Adobe stopped actively supporting PostScript (having given up the rights in order to make it a public standard), and TranScript went away, taking with it the extremely useful enscript text-to-PostScript utility.

Which was reimplemented a-bit-too-faithfully by GNU folks, and then crufted up with useless garbage. Mind you, full compatibility already made it pretty crufty, because the original Knew Way Too Much about how Unix PostScript printer management worked in the late Eighties and early Nineties. What GNU-enscript hasn’t done is keep up with the times: the last release was 12 years ago.

No features. No fixes. No Unicode. No OpenType fonts. No PDF output.

That last bit was particularly grating for me, because a few releases back, Apple abandoned PostScript rendering completely, so the only convenient way to print decades of documents is by shoving them through GhostScript’s ps2pdf, which works, if you’re comfortable with their history of not taking security seriously (grumblegrumble getoffamylawn).

[yes, the free Acrobat Reader still exists, and handles PostScript, but it’s slow to launch and crufted up with Adobe’s attempts at revenue extraction; I have the full Acrobat Pro from the Adobe CC suite, and it’s even slower and cruftier]

I just wanted Unicode text, set in any available fixed-width font, neatly paginated with page numbering and headers/footers, written directly to a PDF file. There are a number of open-source tools that advertise some of these capabilities, but all the ones I’ve tried suck to some degree. Writing my own has been an idea gathering dust in my note-taking apps for several years, but after completing my rewrite of longpass in Python, I decided to finally take a stab at it.

First up, the name: I’ve kept track of all the text-to-pdf tools I came across, brainstormed to find something better, then googled to see which ones had unfortunate connotations. TL/DR: I’m not happy with it, but uc2p is at least short, inoffensive, and fairly unique, so that’s been the working name of the project.

Second, the code. Porting the box and paper modules from PDF::Cairo gave me flexible layout and styling, and after abandoning ReportLab’s Platypus subsystem in favor of the lower-level PDFgen, I was quickly able to knock together some prototypes over the past few days.

The code (~1,200 lines, including libraries) needs a cleanup pass and a real config file for styles, but here’s a sample page of output in the classic -2rGL66 style (two-up, rotated, gaudy headers, 66 lines/page).

By the way, at least with later versions of Adobe’s enscript and with the GNU clone, that -L 66 doesn’t actually do anything useful; -l auto-resizes the font to put exactly 66 lines on each page or column, but it’s incompatible with any page headers or footers. It was kind of an accident that -2rGL66 ever worked as expected; IIRC, it got broken by a margin change in the template in Adobe’s version, and that was faithfully copied by GNU.

What -L N actually does is ensure that no more than N lines will appear on a page. So you could leave the bottom half of the page blank by setting -L 30, for no good reason. My new script, on the other hand, always fits exactly N lines into the space.

Anyway, I’m abandoning drop-in compatibility, so I’m currently going through the various options, giving sensible single-letter abbreviations to the most common ones and moving the rest to a catchall -O opt1=val -O opt2 .... Which will match the structure of the config file where I define layout styles.

Just for fun, a few people out there still have extremely stale web sites, so it’s possible to see what options the Adobe version had in 1992. I love that multi-column printing was under -v, and that there were two completely different one-character options for “send email after job prints” (-w and -m).

And of course everything related to “job burst pages”, something that I haven’t seen in so long that I forgot it was a thing.

Unrelated,

“Dear Amazon, why are you so forcefully recommending a book on talking to small children about sex? I’m pretty sure I’ve never bought anything that would make that relevant for me, or I’d have already gotten a knock on the door from federal agents…”

Jacking up the license plates...


…and changing the car.

Welcome to the first non-trivial update to this blog since 2003. Things are still in flux, but I’m officially retiring the old co-lo WebEngine server in favor of Amazon EC2. After running continuously for fourteen years, its 500MHz Pentium III (with 256MB of RAM and a giant 80GB disk!) can take a well-deserved rest.

The blog is a complete replacement as well, going from MovableType 2.64 to Hugo 0.19, with ‘responsive’ layout by Bootstrap 3.3.7. A few Perl scripts converted the export format over and cleaned it up. LetsEncrypt allowed me to move everything to SSL, which breaks a few graphics, mostly really old Youtube embeds, but cleanup can be done incrementally as I trip over them.

Comments don’t work right now, because Hugo is a static site generator. I’ve worked out how I want to do it (no, not Disqus), but it might be a week or so before it’s in place. All the old comments are linked correctly, at least.

Do I recommend Hugo? TL/DR: Not today.

Getting out of the co-lo has been on my to-do list for years, but I never got around to it, for two basic reasons:

  1. I was hung up on the idea of upgrading to newer blogging software.

  2. I didn’t feel like running the email server any more, and didn’t like the hosting packages that were compatible with MT and other non-PHP blogging tools.

In the end, I went with G-Suite (“Google Apps for Work”) for $5/month. Unlike the hundreds of vendor-specific email addresses I maintain at jgreely.com, I’ve only ever used one here, and all the other people who used to have accounts moved on during W’s first term.

Next up, working comments!

Update

Actually, next turned out to be getting the top-quote to update randomly. The old method was a cron job that used wget to log into the MT admin page and request an index rebuild, which, given the tiny little CPU, had gotten rather slow over the years, so it only ran every 15 minutes.

The site is now published by running hugo on my laptop and rsyncing the output, it’s not feasible or sensible to update the quotes by rebuilding the entire site. So I wrote a tiny Perl script that regexes the current quotes out of all the top-level pages for the site, shuffles them, and reinserts them into those pages. It takes about half a second.

Since there are ~350 pages, there will be decent variety even if I don’t post for a few days and regenerate the set. If I wanted to get fancy, I could parse the full quotes page and shuffle that into the indexes, guaranteeing a different quote on each page (as long as the number of quotes exceeds the number of pages, which means I can add about 800 blog entries before I need to add more quotes. :-)

Yomitori for Windows


The hardest part of getting my Japanese-novel-hacking scripts working on Windows was figuring out how to build the Text::MeCab Perl module. Strawberry Perl includes a complete development environment, but the supplied library file libmecab.lib simply didn’t work. I found some instructions on how to build MeCab from source with MinGW, but that was not a user-friendly install.

However, there were enough clues scattered around that I was able to figure out how to use the libmecab.DLL file that was hidden in another directory:

copy \Program Files\MeCab\sdk\mecab.h \strawberry\c\include
copy \Program Files\MeCab\bin\libmecab.dll \strawberry\perl\bin
cd \strawberry\perl\bin
pexports libmecab.dll > libmecab.def
dlltool -D libmecab.dll -l libmecab.a -d libmecab.def
move libmecab.a ..\..\c\lib
del libmecab.def
cpanm --interactive Text::MeCab

The interactive install is necessary because there’s no mecab-config shell script in the Windows distribution. The manual config is version: 0.996, compiler flags: -DWIN32, linker flags: -lmecab, encoding: utf-8.

Yomitori expressions


Mecab/Unidic has fairly strict ideas about morphemes. For instance, in the older Ipadic morphological dictionary,日本語 is one unit, “Nihongo = Japanese language”, while in Unidic, it’s 日本+語, “Nippon + go = Japan + language”. This has some advantages, but there are two significant disadvantages, both related to trying to look up the results in the JMdict J-E dictionary.

First, I frequently need to concatenate N adjacent morphemes before looking them up, because the resulting lexeme may have a different meaning. For instance, Unidic considers the common expression 久しぶりに to consist of the adjective 久しい, the suffix 振り, and the particle に. It also thinks that the noun 日曜日 is a compound of the two nouns 日曜 and 日, neglecting the consonant shift on the second one.

Second, there are a fair number of cases where Unidic considers something a morpheme that JMdict doesn’t even consider a distinct word. For instance, Unidic has 蹌踉き出る as a single morpheme, while JMdict and every other dictionary I’ve got considers it to be the verb 蹌踉めく (to stagger) plus the auxiliary verb 出る (to come out). The meaning is obvious if you know what 蹌踉めく means, but I can’t automatically break 蹌踉き出る into two morphemes and look them up separately, because Unidic thinks it’s only one.

To fix the second problem, I’m going to need to add a bit of code to the end of my lookup routines that says, “if I still haven’t found a meaning for this word, and it’s a verb, and it ends in a common auxiliary verb, then strip off the auxiliary, try to de-conjugate it, and run it through Mecab again”. I haven’t gotten to this yet, because it doesn’t happen too often in a single book.

To fix the first problem, I start by trying to lookup an entire clause as a single word, then trim off the last morpheme and try again, repeating until I either get a match or run out of morphemes. I had built up an elaborate and mostly-successful set of heuristics to prevent false positives, but they had the side effect of also eliminating many perfectly good compounds and expressions. And JMdict actually has quite a few lengthy expressions and set phrases, so while half-assed heuristics were worthwhile, making them better would pay off quickly.

Today, while working on something else entirely, I realized that there was a simple way (conceptually simple, that is; implementation took a few tries) to eliminate a lot of false positives and a lot of the heuristics: pass the search results back through Mecab and see if it produces the same number of morphemes with the same parts of speech.

So, given a string of morphemes like: (い, て, も, 立っ, て, も, い, られ, ない, ほど, に, なっ, て, いる), on the sixth pass I lookup いても立ってもいられない and find a match (居ても立っても居られない, “unable to contain oneself”) that breaks down into the same number of morphemes of the same type. There’s still a chance it could have chosen a wrong word somewhere (and, in fact, it did; the initial い was parsed as 行く rather than 居る, so a stricter comparison would have failed), but as a heuristic, it works much better than everything else I’ve tried, and has found some pretty impressive matches:

  • 口が裂けても言えない (I) won't say anything no matter what
  • たまったものではない intolerable; unbearable
  • 痛くもかゆくもない of no concern at all; no skin off my nose
  • 火を見るより明らか perfectly obvious; plain as daylight
  • 言わんこっちゃない I told you so
  • どちらかと言えば if pushed I'd say
  • と言えなくもない (one) could even say that
  • 似ても似つかない not bearing the slightest resemblance
  • 痛い目に遭わせる to make (a person) pay for (something)
  • と言って聞かない insisting
  • 聞き捨てならない can't be allowed to pass (without comment)

I’ve updated the samples I created a few weeks ago to use the new parsing. Even on that short story, it makes a few sections clearer.

[Update: one of the rare false positives from this approach: 仲には and 中には break down the same, and since the second one is in JMdict as an expression, doing a kana-only lookup on なかには will falsely apply the meaning “among (them)” to 仲には. Because of variations in orthography, I have to allow for kana-only lookups, especially for expressions, but fortunately this sort of false match is rare and easy to catch while reading.]

Yomitori sample output


I’ve made quite a few improvements since putting the code up on Github. Just having it out in public made me clean it up a lot, but trying to produce a decent sample made an even bigger difference. QAing the output of my scripts has smoked out a number of errors in the original texts, as well as some interesting errors and inconsistencies in Unidic and JMdict.

The sample I chose to include in the Github repo is a short story we went through in my group reading class, Lafcadio Hearns’ Houmuraretaru Himitsu (“A Dead Secret”). The PDF files (text and vocab) are designed to be read side-by-side (I use two Kindles, but they fit nicely on most laptop screens), while the HTML version uses jQuery-based tooltips to show the vocabulary on mouseover.

For use as a sample, I left in a lot of words that I know. If I were generating this story for myself, I’d use a much larger known-words list.

Yomitori 1.0


About two and a half years ago, I threw together a set of Perl scripts that converted Japanese novels into nicely-formatted custom student editions with additional furigana and per-page vocabulary lists. I said I’d release it, but the code was pretty raw, the setup required hacking at various packages in ways I only half-remembered, and the output had some quirks. It was good enough for me to read nearly two dozen novels with decent comprehension, but not good enough to share.

When I ran out of AsoIku novels to read, I decided it was time to start over. I set fire to my toolchain, kept only snippets of the old code, and made it work without hacking on anyone else’s packages. Along the way, I switched to a much better parsing dictionary, significantly improved lookup of phrases and expressions, and made the process Unicode-clean from start to finish, with no odd detours through S-JIS.

Still some work to do (including that funny little thing called “documentation”…), but it makes much better books than the old one, and there are only a few old terrors left in the code. So now I’m sharing it.

https://github.com/jgreely/yomitori

Notes on finishing a novel


A novel in Japanese, that is, converted into a custom “student edition” at precisely my reading level, as described previously.

  1. Speed and comprehension are good; once I resolved the worst typos, parsing errors, and bugs in my scripts, I was able to read at a comfortable pace with only occasional confusion. Words that didn't get looked up correctly are generally isolated and easy to work out from context, and most of the cases where I had to stop and read a sentence several times turned out to be odd because the thing they were describing was odd (such as what the guard does before allowing the original Kino to enter the city in Natural Rights). Of course, it helps to have a general knowledge of the material.
  2. Coliseum was changed significantly for the animated version of Kino's Journey; the original story leaves most of the opponents shallow and one-dimensional, and spends way too much time on the mechanical details of Kino's surprise (both the preparation the night before, and the detailed description of the physical impact and aftermath). Mother's Love, on the other hand, is a pretty straight adaptation.
  3. Casual speech and dialect don't cause as much problem as you might expect. MeCab handles a lot of the common ones, and recovers well from the ones it has to punt on. They didn't confuse me too often, either. After a while. :-)
  4. One thing that MeCab sometimes gets wrong is when a writer uses pre-masu form instead of te-form when listing a series of actions. I don't have a good example at the moment, but I ran into several where it punted and looked for a noun.
  5. The groups that scan, OCR, and proofread novels tend to miss some simple errors where the software guessed the wrong kanji. A good example is writing 兵士 as 兵土 or 兵上. Light novels generally aren't that complicated, and if a word looks rare or out of place, it may well be an OCR error.
  6. The IPA dictionary used by MeCab has some quirks that make it sub-optimal for use with modern fiction. Reading 空く as あく, 他 as た, 一寸 as いっすん, 間 as ま, 縁 as えん, and 身体 as しんたい are all correct sometimes, but not in some common contexts where their Ipadic priority causes MeCab to guess wrong. Worse, it has a number of relatively high-priority entries that are not in any dictionary I've found: 台詞 as だいし, 胡坐 as こざ, 面す and 脱す as verbs that are more common than 面する and 脱する, etc. It also has no entries for みぞれ, 呆ける, 粘度, 街路樹, and a bunch of others. Oddest of all, there are occasions where it reads 達 as いたる; this is a valid name reading, but name+達 is far more likely to be たち than いたる; some quirk of how it calculates the appropriate left/right contexts when evaluating alternatives, an aspect of the dictionary files that I definitely don't understand.
  7. I need to make better use of the original furigana when evaluating MeCab output. I'm preserving it, but not using it to automatically detect some of the above errors. Mostly because I don't want the scripts to become too interactive. Perhaps just flagging the major differences is sufficient.
  8. On to book 2!

Bridging The Gulf of Kanji


Imagine that you’re reading something in a foreign language that you’ve been studying for a while, and you hit an unfamiliar word. You know how to pronounce it, so you can often tell if it’s a place or a person’s name, and you’re pretty sure how many words you’re looking at, so if you need to look them up, you can.

When studying Japanese, the most frustrating thing about trying to graduate from reading “student material” to “real stuff” is not being able to do that. You’re reading along, feeling pretty good about yourself, and you run smack into a wall of kanji. Maybe it’s someone’s name, maybe it’s a city you’d recognize if you could pronounce it, or maybe it’s something like 厳重機密保持体制.

Taken individually, you know most or all of the characters, but together, wtf? Is it safe to skip over and work out from context, or do you need to carefully look up each character, crossing your fingers and hoping that it’s a straightforward collection of two-character nouns (which it is, by the way; literally “strictly-classified-preservation-system”, or, more loosely, “seriously top-secret”). Every time you stop to look something like this up, you lose continuity, and instead of reading, you’re deciphering.

I am far from the first person to notice this, and there are some well-developed tools for helping you read a Japanese web site, of which perhaps the best-known is Rikai. I don’t use it. I will occasionally use the built-in pop-up J-E dictionary on my Mac, which is a simpler version of the same thing, but what I really want to do is read books and short stories.

more...

“Need a clue, take a clue,
 got a clue, leave a clue”