I’ve made quite a few improvements since putting the code up on Github. Just having it out in public made me clean it up a lot, but trying to produce a decent sample made an even bigger difference. QAing the output of my scripts has smoked out a number of errors in the original texts, as well as some interesting errors and inconsistencies in Unidic and JMdict.
The sample I chose to include in the Github repo is a short story we went through in my group reading class, Lafcadio Hearns’ Houmuraretaru Himitsu (“A Dead Secret”). The PDF files (text and vocab) are designed to be read side-by-side (I use two Kindles, but they fit nicely on most laptop screens), while the HTML version uses jQuery-based tooltips to show the vocabulary on mouseover.
For use as a sample, I left in a lot of words that I know. If I were generating this story for myself, I’d use a much larger known-words list.