Thursday, July 13 2006

“Why is that server ticking?”

[this is the full story behind my previous entry on the trouble with tar pipelines]

One morning, I logged in from home to check the status of our automated nightly builds. The earliest builds had worked, but most of the builds that started after 4am had failed, in a disturbing way: the Perforce (source control) server was reporting data corruption in several files needed for the build.

This was server-side corruption, but with no obvious cause. Nothing in the Perforce logs, nothing in dmesg output, nothing in the RAID status output. Nothing. Since the problem started at 4am, I used find to see what had changed, and found that 66 of the 600,000+ versioned data files managed by our Perforce server had changed between 4:01am and 4:11am, and the list included the files our nightly builds had failed on. There were no checkins in this period, so there should have been no changed files at all.

A quick look at the contents of the files revealed the problem: they were all truncated. Not to 0 bytes, but to some random multiple of 512 bytes. None of them contained any garbage, they just ended early. A 24-hour-old backup confirmed what they should have looked like, but I couldn’t just restore from it; all of those files had changed the day before, and Perforce uses RCS-style diffs to store versions.

[side note: my runs-every-hour backup was useless, because it kicked off at 4:10am, and cheerfully picked up the truncated files; I have since added a separate runs-every-three-hours backup to the system]

I was stumped. If it was a server, file system, RAID, disk, or controller error, I’d expect to see some garbage somewhere in a file, and truncation at some other size, perhaps 1K or 4K blocks. Then one of the other guys in our group noticed that those 66 files, and only those 66 files, were now owned by root.

Hmm, is there a root cron job that kicks off at 4am? Why, yes, there is! And it’s… a backup of the Perforce data! Several years ago, someone wrote a script that does an incremental backup of the versioned data to another server mounted via NFS. My hourly backups use rsync, but this one uses tar.

Badly:

cd ~perforce/really_important_source_code
find . -mtime -1 -print > $INTFILES
tar cpf - -T $INTFILES | (cd /mountpoint/subdir; tar xpf -)

Guess what happens when you can’t cd to /mountpoint/subdir, for any reason…

Useful information for getting yourself out of this mess: Perforce proxy servers store their cache in the exact same format as the main server, and even if they don’t contain every version, as long as someone has requested the current tip-of-tree revision through that proxy, the diffs will all match. Also, binary files are stored as separate numbered files compressed with the default gzip options, so you can have the user who checked it in send you a fresh copy. Working carefully, you can quickly (read: “less than a day”) get all of the data to match the MD5 checksums that Perforce maintains.

And then you replace that backup script with a new one…