Effective index use in MongoDB

Three basic rules to keep in mind when trying to index that massive crapload of data you just shoved into MongoDB:

  1. All indexes must fit in physical memory for decent performance.
  2. A query can only use one index.
  3. A single compound index on (x,y,z) can be used for queries on (x), (x,y), (x,z), or (x,y,z). However, prior to 1.6, all but the last field used from the index had to be an exact match, or you might get a full table scan instead.

Indexes are big, and RAM is small, and it’s very easy to run yourself out of memory and watch performance degrade. If the working set is small, you can get by for quite a while, but any operation that does a full table scan will kill you, and any operation that has to page in a substantial number of index pages will at least wound you. The reason is simple: MongoDB memory-maps all of its data into VM and lets the OS figure out how to manage it, which means that paging blocks the server process completely.

To make this concrete, in my logging app, one of the most common operations is the equivalent of the Unix “tail -f” command: continuously show the most recent records for a specific host. As stored in the DB, a full day’s logs take up about 30 GB of disk space, plus another 24 GB for two indexes. The index I actually use is on (hostname, object-id), and master-slave replication adds another index on just object-id. The standard MongoDB object-id is not exactly monotonically increasing, so I’ve replaced it with one that is. This means that a query on hostname returns records in the order they arrived, which is what logging is all about.

The naive way to find the last X of Y records matching a query is to do a find(), a count(), and then skip() to the end. Never do this. Don’t use skip(). Don’t even use count(). Why not? If X is 10, and Y is 2.5 million, and the total size of the collection is 150 million, the count() will page in a bunch of index pages, and the skip() will page in a bunch of data pages (apparently; this is inferred from the runtimes…). Say goodbye to your server. I’ve seen this take over five minutes, and meanwhile everything else is backing up, and users are aborting and retrying, which adds even more queries to the end of the request queue.

The next step up is to do what you’d do in SQL: sort by descending object-id and limit the results. This is better, but still not good, because my compound index was built on ascending hostname and object-id, and is, as the docs say, “less effective” when used backwards.

The lightning-fast way, which avoids all the locking, paging, sorting, counting, and just plain fiddling, is to exploit the fact that the object-id starts with a timestamp, and do a range query on it. In both the standard and my custom object-ids, the first four bytes contain the time in seconds since the Unix epoch, so instead of trying to locate the last X matching records, I can trivially locate the last X seconds worth of matching records. Just get the current time, subtract X, generate a phony object-id from the result, and ask for all records for the host with object-id greater than that.

This is a multiple-orders-of-magnitude improvement over the naive solution, and considerably faster than the “ORDER BY object-id DESC LIMIT 10” SQL-style solution. Most importantly, it’s guaranteed to only hit the most recent index and data pages, which are also the ones most likely to already be in memory. It’s actually faster for hosts that log millions of lines per day than the ones that only log a few hundred, because I can start out by asking for the last 5 seconds, then 15, 60, etc, and finally 86400 for the really light loggers.

And once I retrieve the records, I can save the object-id of the last one and use it in my next query, for an equally fast “give me what arrived since the last time I asked”. I was already using that for fast follow-up queries, it just hadn’t occurred to me to exploit it for the initial query. Thought of it while driving home last night and had it in Production half an hour later.