Thursday, September 4 2008

Upgrading Movable Type

The machine this site runs on hasn’t been updated in a while. The OS is old, but it’s OpenBSD, so it’s still secure. Ditto for Movable Type; I’m running an old, stable version that has some quirks, but hasn’t needed much maintenance. I don’t even get any comment spam, thanks to a few simple tricks.

There are some warts, though. Rebuild times are getting a bit long, my templates are a bit quirky, and Unicode support is just plain flaky, both in the old version of Perl and in the MT scripts. This also bleeds over into the offline posting tool I use, Ecto, which occasionally gets confused by MT and converts kanji into garbage.

Fixing all of that on the old OS would be harder than just upgrading to the latest version of OpenBSD. That’s a project that requires a large chunk of uninterrupted time, and we’re building up to a big holiday season at work, so “not right now”.

I need an occasional diversion from work and Japanese practice, though, and redesigning this blog on a spare machine will do nicely. I can also move all of my Mason apps over, and take advantage of the improved Unicode support in modern Perl to do something interesting. (more on that later)

Step 1: the host. We’re a CentOS shop at work, so even though the real machine will run OpenBSD, it’s simpler and faster to get all the packages I need from a decent Linux distribution. Centos 5.2 on an old Shuttle I had around the house works fine.

Step 2: the database. Mysql is ridiculous overkill for a one-man blog, and I’ll need all of the Sqlite libraries for my other projects anyway, so that’s the way to go. Or it was, until I found out that the MT 4.21 publish queue is not compatible with a Sqlite backend. I think it’s quite rude for an application to trap SIGSEGV, ignore it, and then retry the same call, sending the process into a tight little loop of segfaults. It works just fine for static publishing, but blows up spectacularly when combined with TheSchwartz. Oh, well, back to Mysql.

Step 3: the data. There are two ways to preserve all my old permalinks: upgrade my old Mysql database through major revisions of both Mysql and MT, or export everything, publish in a new permalink format, and then use symlinks to point to the new home of the legacy pages. The symlink thing will be less painful.

Step 4: reload the data. As usual, it’s easy to make a backup, but restores are a bitch. The old MT only supported a simple text-based export that didn’t preserve things like templates or entry numbers (hence the need for those symlinks). This imported painlessly, and I cheerfully tinkered with templates and settings. Then I abandoned Sqlite, and needed to move all the data to Mysql. Fortunately, there’s now a nice XML-based export format that preserves almost everything, and that can be reloaded into a brand-new copy of MT.

Step 5: debug the data. The downside of an XML-based backup format is that XML is completely, totally, and in all ways relentless and unforgiving. Don’t ever export your data as XML unless you’re absolutely certain that it validates, because any actual XML parser that tries to read it will choke on your errors. And, seriously, it’s no fun being told that column 312 of line 17492 has an error in it, which, when corrected, reveals that column 104 of line 18117 has an error. And so on, and so on.

Step 6: reimport the data. What happened? A whole bunch of my early entries had entity-encoded characters in them that didn’t get escaped at all when MT 4.21 exported them to XML. So they sat there in the file, like little time bombs, waiting for me to dare to feed them into the same XML library that wrote them out. “Nyah, nyah,” they cried. The workaround was to delete all the entries from the Sqlite version, back it up again to preserve settings and templates, load that into Mysql, decode all of the entities in the original exports with a one-line Perl script and HTML::Entities, and import them again.

Step 7: tinker. MT has built-in support for memcached, and there’s a free add-on CacheBlock that has the potential to improve large rebuilds by storing arbitrary formatted chunks of data for reuse. Of course, this isn’t always faster than just formatting it again, but there are some horribly expensive operations in their default templates that benefit from it. My first rebuild with their default templates took over 45 minutes, and just caching the sidebar dropped it below 10.

Step 8: tinker. The basic plan is to design around a completely unstyled page with semantic markup so that it degrades gracefully, then add just enough CSS and Jquery-based Javascript to tidy it up. For instance, with Jquery, the category listing in the sidebar can be a simple HTML list, but be converted into a pulldown menu if Javascript is enabled. This is the same thing I do with my pop-up furigana, which appear in the HTML as title attributes on a SPAN tag, so that they work even when JS is off. Ditto for embedding photos, etc.

Step 9: tinker. Out of the box, Movable Type still doesn’t paginate. You can generate all sorts of archive pages, and you get easy links between pages, but what you don’t get is the ability to split up an archive into bite-sized, nicely numbered pages. There are addons, which have a variety of problems (including the simple “costs money” in one case, and “uses PHP” in another), but none of them seemed worthwhile. So, naturally, I wrote my own. It is in fact trivial to post-process a page generated from templates and split it into as many pages as you want, even inserting your own navigation arrows. It takes MT several minutes to completely rebuild this blog; it takes about 1.5 seconds to paginate every index file, including the main one with over a thousand entries in it.

Step 10: tinker. I don’t mean this, the actual blog, by the way. I kind of screwed myself with the current design, which only prints the date header when it changes. I was still able to paginate it, but the code’s kind of messy, and extra effort is required to avoid insanely long rebuilds. The short version is that I really need to have two main index pages, a short one that’s generated automatically when I add a new entry, and one that contains the entire history of the blog, to be paginated. The standard index just needs to know the total number of pages to generate its navigator; pages 2 through N can be generated asynchronously. Which is a lot easier with the publish queue in 4.21; I just use TheSchwartz to request a rebuild of the full page, and whenever it finishes, paginate it.

Step 11: tinker. MT’s categories have a lot of overhead. They’re designed around big, sturdy archive pages that, once again, don’t paginate. My Japan category has over 200 entries in it, and thanks to my vacation, an awful lot of pictures. It really, really needs to be paginated. And, to be honest, it’s a real catchall of various vaguely-related topics that could profitably be split into separate categories (or in modern MT, subcategories), further bloating the list in the sidebar. Or I could make aggressive use of the new, lightweight tagging system. Or a little of both.

Step 12: tinker. Philip Greenspun has a very pretty book on web design in which he argues that most web sites are really databases (not “have databases attached to them”, but conceptually). I’ve always disagreed. They’re not databases, they’re reports. Dynamic sites like stores and forums are ad-hoc reports, and your query tool is an “interesting” URL, and there’s that whole AJAX “subquery” thing, but most of the time, there’s a whole lot of boilerplate in your reports, and that should be cached. Or, as our primitive ancestors did, generated in advance and stored on disk. As, y’know, complete HTML pages that still work when your database server eats itself.

Step 13: The Other Project. I recently went back and redid my Japanese-dictionary lookup tool. They’d made enough changes to the DTD for JMdict’s XML format that it wasn’t worth keeping the old code, and I’ve gotten better at this sort of thing since then anyway, so I started over from scratch. Right now, I’ve got a comprehensive J-E/E-J dictionary that runs lightning-fast from the command-line, built with Perl, Sqlite, XML::Twig, Sphinx, PDF::API2::Lite, and Kakasi. It already generates very nice PDF files for my Sony Reader, so HTML would be easy. And I’ve just never liked any of the existing online dictionaries.