Saturday, March 6 2004

Latchkey Zombies in Solaris

A funny thing happened when we upgraded our servers from Solaris 2.5.1 several years ago: when we killed a process, frequently its parent wouldn’t notice. This was annoying, since a lot of our Operations processes were built around killing and restarting services so they’d notice changes in a controlled fashion.

We filed a bug with our developers, suspecting that the somewhat baroque signal-handling in our service framework was now broken. They were never able to reproduce the problem outside of the Production service, but they made some changes, and it became less frequent. Still common enough to be annoying, but harder to reproduce.

On and off since then, we’ve had someone poke at the problem and try to solve it. About a year ago, we ran out of motivated developers and it was handed to someone in our group, who spent several months completely rewriting that subsystem because its design offended him. After half a dozen tests of his code, I was able to verify that it was a complete replacement for the old one. Bug-compatible, even, which meant we still hadn’t made any progress.

This led us to suspect a kernel bug, which was great, except that our test case was “run this proprietary code for at least 48 hours on a machine that’s serving 30,000 paying customers, and there’s a 50% chance that a killed process won’t be reported to its parent”. This is not the sort of minimal repeat-by that vendors want to see.

The programmer was ready to give up, and suggested we just scrap our code and modify daemontools to do what we wanted. My screams of horror met with stunned disbelief; apparently he hadn’t realized that we weren’t kidding the last time we shot down a suggestion to use one of DJB’s tools. He was gently persuaded that this was not a viable alternative.

Suddenly, I was struck with an idea: “why don’t we just ignore the zombies?”. We knew the service was dead, we knew that the zombies weren’t listening on a port or consuming lots of memory, and we knew that the kill/restart procedures were only run occasionally, so why worry about them? Maybe they’d go away on their own, maybe they’d hang around until the next full service restart, but it’s not like they were eating brains or anything.

The programmer wanted to try this workaround in his version (it was, after all, obviously superior), but I was able to persuade him to make it work in the original code. He didn’t understand why until we tried to get the patch rolled out, and a small army of managers and project managers started haggling over code reviews, testing, release schedules, etc.

Estimated time to roll out the patch to Production: four weeks. Estimated time to roll out his completely new version: four months. Care to guess which one we’re going with?