Tuesday, July 20 2010

Life with MongoDB

Before I ramble, let me sum up my feelings about MongoDB:

Good: easy to install, easy to start using, easy to dump a metric crapload of loosely-structured data into, easy to write basic queries and indexes, rapid high-quality (free!) support from the core development team, phenomenal cosmic power.

Bad: hard to install some drivers (Perl in particular pulls in an amazing number of poorly-specified dependencies), hard to get a metric crapload of data back out quickly, hard to write advanced queries and indexes, easy to reach “here there be dragons” territory outside the scope of documentation and guarantees and available tools, easy to lock up a server process for minutes or hours with simple operations, easy to lose everything unless you replicate to at least one other machine that has independent power, eats RAM like you wouldn’t believe.

It is, if you like, the Perl of databases, immensely powerful and versatile, but extremely dangerous in the hands of amateurs. And, like early releases of Perl, it is a rapidly-moving target that grows new functionality every time you turn your back on it.

The past few weeks have been a heady mix of excitement, frustration, satisfaction, and annoyance, as my first large-scale use of MongoDB has moved from simple prototype, to initial production version, through major bugs and poorly-documented limitations (often “already fixed in next release”), to what I hope is finally “stuff I can just leave running and go work on other projects again”.

So that you have an idea of the scale and scope of the project I’ve been working on, our products report their health and status to us via encrypted syslog, and standard logging tools were not coping with the volume. The goal was simply storing the logs in a way that made it possible to view the most recent entries for a device (tail -f), search through a day’s logs for one or more devices (grep), and view one or more devices’ logs for one or more days (cat).

Easy enough in general, except that we’re talking about over 130,000,000 incoming syslog records per day, from over 100,000 different devices. Just preventing that from being a self-inflicted DDOS attack has kept us entertained on and off for weeks in the past, but storing it turned out to be as painful as receiving it. Syslog-ng leaks memory and filehandles so rapidly that we have to restart it every three hours to keep it from locking up the server, and attempting to rapidly write to 100,000 open files on a single file system does “interesting” things to the EXT3 kernel drivers.

(this, by the way, is why we had switched the importer to running on OpenBSD a while back: BSD with FFS2 doesn’t lose track of the filesystem state the same way that Linux does with EXT3. It has other entertaining problems, including a willingness to panic when network buffers fill up, but it was less of a headache than trying to predict how often to reboot the CentOS machine to keep the kernel from losing track of the data)

MongoDB has two features that looked tailor-made for my needs: GridFS and capped collections.

GridFS is a db-based filesystem, allowing you to store data of any size with arbitrary metadata; I use it for aggregated daily logs (“$host/$date”), and I’m up to 6 million daily logfiles taking up 1.1 terabytes of space. Retrieval of entire files or individual blocks is nearly as fast as reading flat files, and significantly more findable. If this was all you could do with MongoDB, it would be one of my favorite new tools.

Capped collections are designed for logging, and offer two guarantees that standard collections do not: fixed size and preservation of insert ordering. You set the size up front, and when the collection fills up, it wraps around, deleting the oldest records, so that at any given time you have the most recent N bytes worth of data (and, yes, it’s specified in bytes, despite the fact that you can create terabyte-scale capped collections if you want to, which you don’t). Preserving insert order allows you to be sure that a query for host X will return X’s logs in the order that they arrived.

But there’s a catch, and it’s a big one: insert order is only preserved if your query performs a full table scan. Add an index to a capped collection, and retrieval is much faster, but items with identical values for the index key will be returned in disk-allocation order (current through v1.5.5; they may well decide to fix this soon). As soon as the collection wraps back around to the first block, the newest data will now appear before all earlier data, which was quite an unpleasant surprise one day.

Rolling the data over into the GridFS restored the correct order, because I used their mongoexport tool to dump the entire collection, which was a non-indexed query and therefore in insert order. I simply had to collate the JSON-format dump and insert everything into the grid.

Sadly, that wasn’t the end of my troubles, and in fact I’m writing this a bit out of order, because before I discovered the joys of index ordering, I was pummeled with the fact that the indexes were growing without limit. My collection was capped at 90GB, and the server dutifully deleted the oldest records, but it never deleted their index entries. My index was, at the time, growing at 15GB/day for 25GB/day of data, and MongoDB performance is directly proportional to how much of the index (and, indeed, the data) fits in physical memory. On a server with 24GB of RAM, one day was really fast, two days were pretty fast, three days were occasionally a little slow, but by the sixth day it was paging like mad, holding locks for minutes at a time.

And, by the way, read locks were blocking write locks were blocking read locks were blocking authentication requests were driving J to drink. The good news is that both the index cleanup and the insane lock contention problems were fixed in 1.5.4, and will be in the new stable 1.6 version at the end of this month. Don’t start developing anything with the current stable 1.4.4 release; develop with 1.5.5, deploy with 1.6.x.

My current, and hopefully lasting, solution is to use alternating daily capped collections, big enough not to wrap around. The client scripts know to search log_$date for today and, if it hasn’t reached the GridFS yet, yesterday, and as soon as the data is exported from yesterday’s log, it is dropped and recreated as tomorrow’s, which frees up the index space as well. This keeps the current day’s index small enough to fit in memory, and decreases paging for the data as well. The full export of yesterday’s log takes a little over an hour, so there’s only a fairly brief period when I have more than one day’s worth of indexes in the database at all.

The reason that it might not be a lasting solution is that it’s not guaranteed that a new capped collection made from released space will have the logical order of its blocks match the on-disk order. They honestly weren’t sure what I’m going to see after a few drop/create cycles.

Which means I need a Plan B (er, “G” by now, I think), and that’s a compound index on host and timestamp. But that not only increases the size of the index by a few gigabytes/day (ow, my aching RAM!), it increases the size of the data as well, unless I can use a timestamp that’s already present in the collection, like the automatically-generated _id field.

Unfortunately, the standard ObjectID generated by all supported drivers is 12 bytes containing (big-endian) a 4-byte timestamp in seconds, a 3-byte hostid, a 2-byte process id, and a 3-byte counter. With over 2,000 inserts/second, it is guaranteed to be out-of-order for me.

Fortunately, the ObjectID is generated by the drivers, not the server, so it was easy for me to replace theirs with one based on gettimeofday(), which includes microseconds. My IDs are now 4-byte seconds, 4-byte microseconds, 2-byte counter, and 2-byte process id. On a fast 64-bit Linux box, 10,000,000 such IDs generated in a very tight loop were all in increasing order. A very few managed to hit the same microsecond, but the counter is large enough that the chance they’d be out-of-order is very small.

Once I switch to the compound index, the only remaining feature of the capped collection would be its fixed size, and that’s actually a drawback, because I have to keep track of how fast we sell products, and how often developers whimsically increase logging. I may well switch to standard collections at that point, using the daily swap to keep the size under control, but first I have to learn one very important thing: how long it takes to export all of the data using the new compound index.

I know that I can’t use mongoexport to do that, and I know that individual queries by host would take more than twelve hours, so I have to test the results of an empty query with an index hint (that is, in their awkward query “language”, db.log.find({}).hint({host:1,_id:1})). I know that it will return neatly-collated data in (host,_id) order; I just don’t know how long it will take.

In the end, despite the redesigns and workarounds I’ve had to go through to get this finished, the bottom line is that our logs were in better shape after a week of development with MongoDB than they’d ever been, and are now significantly more useful for finding and resolving customer problems. Improvements planned for 1.6 and beyond will give us even more benefit. Other groups in the company are planning to implement their projects with MongoDB, and I think it will work out great. They just need to make sure their budget has room for at least two servers with a lot of RAM.

Most of my personal projects either have been, or soon will be, moved to MongoDB, because once you manage to wrap your head around it, it offers ridiculous power for working with messy data. Just like Perl.