Since a new version of the free-as-in-fork LibreOffice package was just released, I thought I’d take a look and see if it’s gotten any easier to import formatted text.
The answer: “kinda”.
Good: It imports simple HTML and CSS.
Bad: …into a special “HTML” document type that must be exported to disk in ODT format, and then reopened. Otherwise, all formatting not available for web use will either disappear from all menus and dialog boxes, silently fail, or be deleted when you save (generally the result of pasting from another document).
[note that the Mac version crashed half a dozen times as I was exploring these behaviors, but it usually managed to open the documents on the second try]
Sadly, furigana are not considered compatible with HTML, so they’re stripped on import, making it rather a moot point that you can’t edit them in HTML mode. The only way to import text marked up with furigana is to generate a real XML-formatted, Zip-archived ODT file.
Just spent a merry, no wait, hellish few hours fighting to get a LaTeX distribution up and running for the sole purpose of running a single script that uses it to convert marked-up Japanese text to PDF in convenient ebook sizes.
I failed. Or, more precisely, I got all the way to a DVI file that could be displayed quite nicely on screen, with all the kanji and furigana intact, but then the PDF converter that was part of the same TeX package that had generated it started barfing all over my screen, and I refused to spend more time on the project. I simply have no desire to navigate the layers and layers and layers of crap that TeX has acquired in its hacked-together support for modern fonts and encodings.
Honestly, if I want to generate cleanly-formatted Japanese text as a PDF, with furigana and vertical layout and custom page sizes, it takes 10,000 times less effort to spit out bog-standard HTML+CSS and feed it to Microsoft Word.
[Note to the MS-allergic: performing the equivalent import into OpenOffice is possible, but not reasonable. Getting basic unstyled plaintext+furigana wasn’t too bad, but anything more complicated would be an exercise in tedious XML debugging.]
[Update: gave it another go, and eventually discovered that running dvipdfmx with KPATHSEA_DEBUG=-1 in the environment returned a completely different search path than the kpsewhich tool used. Copying share/texmf/web2c/texmf.cnf.ptex to etc/texmf/texmf.cnf made all the problems go away. At least until the next time I upgrade something in MacPorts that recursively depends on something that obsoletes a recursive dependency of pTeX and hoses half my tools.
And, no, I can’t use the self-contained and centrally-managed TeX Live distribution (or the matching GUI-enabled MacTeX). That was the first hour I wasted. Its version of pTeX is apparently incompatible with one of the style files I needed.]
For a slightly-early birthday present to myself, as part of a post-Thanksgiving sale I bought myself an OWC Data Doubler w/ 240GB SSD. After making three full backups of my laptop, I installed it, and have been enjoying it quite a bit. This kit installs the SSD as a second drive, replacing the optical, allowing you to use it in addition to the standard drive, which in my case is a 500GB Seagate hybrid. I’ve set up the SSD as the boot drive, with the 500GB as /Users/Shared, and moved my iTunes, iPhoto, and Aperture libraries there, as well as my big VMware images and an 80GB Windows partition.
[side note: the Seagate hybrid drives do provide a small-but-visible improvement in boot/launch performance, but the bulk of your data doesn’t gain any benefit from it, and software updates erase the speed boost until the drive adjusts to the new usage pattern. Dual-boot doesn’t help, either. An easy upgrade, but not a big one, IMHO.]
Good:
Bad:
So, file this little experiment under “expensive but worth it”. I do watch DVDs on my laptop, but only at home or in hotels, so the external drive isn’t a daily-carry accessory. The SSD has a Sandforce chipset and 7% over-provisioning, and is less than half full, so there’s no sign of performance degradation, and I don’t expect any. Aperture supports multiple libraries, so I can edit fresh material on the SSD, then move it to the hard drive when I’m done with it. Honestly, unless Apple releases MacBook Pro models that wil take more than 8GB of RAM, I really see no need to buy a new one for quite a while.
When you send me a SQL statement that updates a 600,000-record table based on a join to a 900,000-record table, please make sure there are indexes involved. Also, please don’t test on a toy database.
Just got a complaint from a user about a Perl script that wasn’t handling regular expressions correctly. Specifically, when he typed:
ourspecial-cat | grep 'foo\|bar'
he got a match on “foo” or “bar”, but when he typed:
ourspecial-grep 'foo\|bar'
he got nothing at all.
My surprise came from the fact that the normal grep worked, when everyone knows that you need to use egrep for that kind of search, and in any case, since the entire regular expression was in single-quotes, you don’t need the backslash. Removing the backslash made our tool do what he wanted, but broke grep.
Sure enough, if you leave out the backslash, you need to use egrep or grep -E, but if you put it in, you can use grep. What makes it really fun is that they’re the same program, GNU Grep 2.5.1, and running egrep should be precisely the same as running grep -E.
Makes me wonder what other little surprises are hidden away in the tools I use every day…
Three basic rules to keep in mind when trying to index that massive crapload of data you just shoved into MongoDB:
The current generation of S12 with the ION graphics chipset has been discontinued, with all remaining inventory now in the Lenovo Outlet Store for $399. I’ve been quite happy with mine. The ION gives it decent performance for HD video and light gaming, and it has a full-sized keyboard and bright, crisp screen with decent resolution.
[Update: they also have hundreds of brand-new power supplies for $16. For that price, I can have one at home, one at the office, and one in the trunk of the car, and never carry one around. They also have a hundred or so of the 10-inch netbooks in a major scratch-and-dent sale ($220), and some refurbished 10-inch tablet netbooks]
Suppose you had a big XML file in an odd, complicated structure (such as JMdict_e, a Japanese-English dictionary), and you wanted to load it into a database for searching and editing. You could faithfully replicate the XML schema in a relational database, with carefully-chosen foreign keys and precisely-specified joins, and you might end up with something like this.
Go ahead, look at it. I’ll wait. Seriously, it deserves a look. All praise to Stuart for making it actually work, but damn.
…
Done? Okay, now let’s slurp the whole thing into MongoDB: