Computers

Fun with LibreOffice...


Since a new version of the free-as-in-fork LibreOffice package was just released, I thought I’d take a look and see if it’s gotten any easier to import formatted text.

The answer: “kinda”.

Good: It imports simple HTML and CSS.

Bad: …into a special “HTML” document type that must be exported to disk in ODT format, and then reopened. Otherwise, all formatting not available for web use will either disappear from all menus and dialog boxes, silently fail, or be deleted when you save (generally the result of pasting from another document).

[note that the Mac version crashed half a dozen times as I was exploring these behaviors, but it usually managed to open the documents on the second try]

Sadly, furigana are not considered compatible with HTML, so they’re stripped on import, making it rather a moot point that you can’t edit them in HTML mode. The only way to import text marked up with furigana is to generate a real XML-formatted, Zip-archived ODT file.

"Y'see, software has layers"


Just spent a merry, no wait, hellish few hours fighting to get a LaTeX distribution up and running for the sole purpose of running a single script that uses it to convert marked-up Japanese text to PDF in convenient ebook sizes.

I failed. Or, more precisely, I got all the way to a DVI file that could be displayed quite nicely on screen, with all the kanji and furigana intact, but then the PDF converter that was part of the same TeX package that had generated it started barfing all over my screen, and I refused to spend more time on the project. I simply have no desire to navigate the layers and layers and layers of crap that TeX has acquired in its hacked-together support for modern fonts and encodings.

Honestly, if I want to generate cleanly-formatted Japanese text as a PDF, with furigana and vertical layout and custom page sizes, it takes 10,000 times less effort to spit out bog-standard HTML+CSS and feed it to Microsoft Word.

[Note to the MS-allergic: performing the equivalent import into OpenOffice is possible, but not reasonable. Getting basic unstyled plaintext+furigana wasn’t too bad, but anything more complicated would be an exercise in tedious XML debugging.]

[Update: gave it another go, and eventually discovered that running dvipdfmx with KPATHSEA_DEBUG=-1 in the environment returned a completely different search path than the kpsewhich tool used. Copying share/texmf/web2c/texmf.cnf.ptex to etc/texmf/texmf.cnf made all the problems go away. At least until the next time I upgrade something in MacPorts that recursively depends on something that obsoletes a recursive dependency of pTeX and hoses half my tools.

And, no, I can’t use the self-contained and centrally-managed TeX Live distribution (or the matching GUI-enabled MacTeX). That was the first hour I wasted. Its version of pTeX is apparently incompatible with one of the style files I needed.]

Life with SSD


For a slightly-early birthday present to myself, as part of a post-Thanksgiving sale I bought myself an OWC Data Doubler w/ 240GB SSD. After making three full backups of my laptop, I installed it, and have been enjoying it quite a bit. This kit installs the SSD as a second drive, replacing the optical, allowing you to use it in addition to the standard drive, which in my case is a 500GB Seagate hybrid. I’ve set up the SSD as the boot drive, with the 500GB as /Users/Shared, and moved my iTunes, iPhoto, and Aperture libraries there, as well as my big VMware images and an 80GB Windows partition.

[side note: the Seagate hybrid drives do provide a small-but-visible improvement in boot/launch performance, but the bulk of your data doesn’t gain any benefit from it, and software updates erase the speed boost until the drive adjusts to the new usage pattern. Dual-boot doesn’t help, either. An easy upgrade, but not a big one, IMHO.]

Good:

  • Straightforward installation. There was only one finicky bit where OWC's detailed instructions didn't quite match reality, which just required a little gentle fiddling to get the cable out of the way.
  • Boots faster. Much.
  • All applications launch faster, especially the ones that do annoying things like maintain their own font caches. Resource-intensive apps (pronounced "Photoshop") also get a nice speed boost for many operations, especially when I'm working with 24 megapixel raw images.
  • Apple's gratuitous uncached preview icons render acceptably fast now. Honestly, I got so sick of delays caused by scanning large files to generate custom icons that I turned it off a long time ago (except the magic Dock view of the Downloads folder, which you can't disable it for).
  • SuperDuper incremental backups are ~4x faster. 10-15% of this comes from not having to scan the 160GB of stuff that's now on a separate drive, but most of it is due to not seeking around on the disk to see what's changed. I've actually switched my backups from a fast FireWire800 enclosure to a portable USB2 drive, and I still save a lot of time.
  • A little better battery life, a little less heat. The hard drive stays spun down most of the time unless I have iTunes running.
  • External USB DVD drives work fine for ripping, burning, and OS installation.

Bad:

  • Apple's DVD Player app refuses to launch at all unless an internal DVD drive is present; external drives aren't acceptable (unless there's also an internal, in which case you can cheerfully use both, and even have them set to different regions). VLC is a poor substitute. You can get it to work with external drives... by editing the binary. Seriously, you replace all instances of "Internal" with "External" in this file:
    /System/Library/Frameworks/DVDPlayback.framework/Versions/A/DVDPlayback
  • Time Machine backups don't pick up the second drive. Neither does SuperDuper, so I added a one-line script to my SuperDuper config that does:
    rsync -v -‍-delete -‍-exclude .Spotlight-V100
        -‍-exclude .Trashes -‍-exclude .fseventsd
        /Users/Shared/ /Volumes/Back2/
  • Snow Leopard seems to have lost the ability to reliably specify the mount location of a second drive. In previous releases, you could put the UUID or label into /etc/fstab and it worked. Now that file only accepts device names, which are generated dynamically on boot. This works if only two drives are present at boot time, since the boot drive will always get disk0, but having an external drive connected could result in a surprise.
  • (obvious) Can't watch actual DVDs without carrying around an external drive.

So, file this little experiment under “expensive but worth it”. I do watch DVDs on my laptop, but only at home or in hotels, so the external drive isn’t a daily-carry accessory. The SSD has a Sandforce chipset and 7% over-provisioning, and is less than half full, so there’s no sign of performance degradation, and I don’t expect any. Aperture supports multiple libraries, so I can edit fresh material on the SSD, then move it to the hard drive when I’m done with it. Honestly, unless Apple releases MacBook Pro models that wil take more than 8GB of RAM, I really see no need to buy a new one for quite a while.

Dear Developer,


When you send me a SQL statement that updates a 600,000-record table based on a join to a 900,000-record table, please make sure there are indexes involved. Also, please don’t test on a toy database.

iGrep


Just got a complaint from a user about a Perl script that wasn’t handling regular expressions correctly. Specifically, when he typed:

ourspecial-cat | grep 'foo\|bar'

he got a match on “foo” or “bar”, but when he typed:

ourspecial-grep 'foo\|bar'

he got nothing at all.

My surprise came from the fact that the normal grep worked, when everyone knows that you need to use egrep for that kind of search, and in any case, since the entire regular expression was in single-quotes, you don’t need the backslash. Removing the backslash made our tool do what he wanted, but broke grep.

Sure enough, if you leave out the backslash, you need to use egrep or grep -E, but if you put it in, you can use grep. What makes it really fun is that they’re the same program, GNU Grep 2.5.1, and running egrep should be precisely the same as running grep -E.

Makes me wonder what other little surprises are hidden away in the tools I use every day…

Effective index use in MongoDB


Three basic rules to keep in mind when trying to index that massive crapload of data you just shoved into MongoDB:

  1. All indexes must fit in physical memory for decent performance.
  2. A query can only use one index.
  3. A single compound index on (x,y,z) can be used for queries on (x), (x,y), (x,z), or (x,y,z). However, prior to 1.6, all but the last field used from the index had to be an exact match, or you might get a full table scan instead.

more...

Lenovo Ideapad S12 clearance


The current generation of S12 with the ION graphics chipset has been discontinued, with all remaining inventory now in the Lenovo Outlet Store for $399. I’ve been quite happy with mine. The ION gives it decent performance for HD video and light gaming, and it has a full-sized keyboard and bright, crisp screen with decent resolution.

[Update: they also have hundreds of brand-new power supplies for $16. For that price, I can have one at home, one at the office, and one in the trunk of the car, and never carry one around. They also have a hundred or so of the 10-inch netbooks in a major scratch-and-dent sale ($220), and some refurbished 10-inch tablet netbooks]

Using MongoDB


Suppose you had a big XML file in an odd, complicated structure (such as JMdict_e, a Japanese-English dictionary), and you wanted to load it into a database for searching and editing. You could faithfully replicate the XML schema in a relational database, with carefully-chosen foreign keys and precisely-specified joins, and you might end up with something like this.

Go ahead, look at it. I’ll wait. Seriously, it deserves a look. All praise to Stuart for making it actually work, but damn.

Done? Okay, now let’s slurp the whole thing into MongoDB:

more...

“Need a clue, take a clue,
 got a clue, leave a clue”