Saturday, August 9 2003

The Perl Script From Hell

I’ve been working with Perl since about two weeks before version 2.0 was released. Over those fifteen years, I’ve seen a lot of hairy Perl scripts, many of them mine.

None of them can compare to the monster that lurks in the depths of our service, though. Over 8,000 lines of Perl plus an 8,000-line C++ module, written in a style that’s allegedly Object Oriented, but which I would describe as Obscenely Obfuscated (“Hi, Andrew!”).

We have five large servers devoted to running it. Each contributes three CPUs, three gigabytes of memory, and 25 hours of runtime to the task (independently; we need the redundancy if one of them crashes). Five years ago, I swore a mighty oath to never, ever get involved with the damned thing.

Then it broke. In a way that involved tens of thousands of unhappy customers.

Part of my determination to steer clear of this turkey was that no one could ever figure out why it chewed up so many resources and ran for so long. Several of us have a pretty decent grasp of its input and output, and the process by which one becomes the other, but that doesn’t really help.

For more fun, it keeps internal state in an odd way, so that if it crashes somewhere in the middle, you can restart it and have it pick up where it left off. In fact, you must restart it this way, to keep it from newfs’ing the wrong file system.

Yes, it runs as root and automatically formats a file system every time it starts up. Did I forget to mention that part? “Good Heavens, Why?!?!?!,” you might ask. The short version is that there used to be this OS bug where deleting and recreating tens of thousands of files every day gradually degraded file system performance to the point where the 25-hour job took nearly 40 hours. The developer came up with this extremely clever workaround one day, and didn’t tell anyone in Operations until the code had to be in Production the next day.

But I digress.

Despite my determination, I have in fact learned a few things about it over the years, enough to be able to stop the occasional ambitious new developer from modifying it in disastrous ways. It’s been risky to admit this knowledge, but I got away with it for a long time.

As you might guess, Operations tends to treat this script like black magic, and has developed a variety of superstitions and incantations to coax it back to life whenever it fails in a non-trivial way.

There was one failure in particular that we’d never found an explanation for: every few months, instead of generating 7 days worth of data in a run, it would generate 6. Then 5, 4, and sometimes down to 3 before it would suddenly be back to normal. Even though it generated a log file nearly the size of the script, it never contained a clue to the problem. In fact, the only clue we ever got wasn’t in the logs at all. It only appeared in email sent by the cron job that kicked off the script:

Oh shit: fState != kParseError
Oh dung of a thousand camels - somebody called Debugger()!

(For obscure reasons, this email didn’t go to anyone who cared about the script. We discovered it by accident during the events that follow.)

Finally, the day came when the script didn’t recover on its own. It was only generating one day’s worth of data, and customers were complaining. Worse, these were our satellite customers, so it was our partner’s help desk that was getting all the complaints. The word came down from on high: This Must Be Fixed Today.

They called me at home on a Saturday afternoon and offered the assistance of Matt, an experienced C++ developer who had been working with our service code for a while. We had until around 11pm to get seven days of good data uploaded to the satellite. The “or else” was left implicit. Three other people were volunteered as our support crew, forced to hang around until we had a fix they could test. Lucky them.

We had one stroke of luck, and that was that we had a separate server generating the output for the satellite customers. It didn’t have to grovel over the full data set for 25 hours. It could complete a full run in about 20 minutes, and the incomplete and buggy run only took 8 minutes.

Well, 16 minutes, actually, because the internal state-checking code got confused and picked up from after the place where the script was choking. So we had to run it twice after each of our changes; once to clear out the state, once to test the change. I think our first hour was wasted discovering this little fact.

With that out of the way, and the famous “Oh shit” message located and deciphered, some good old-fashioned printf() debugging quickly narrowed the problem down to a particular function in the C++ library, which converts the in-memory data structure into a compressed format and returns it to the Perl script to be written out to a file. Unfortunately, the code was written in a familiar style known as “do a whole bunch of things and don’t check for errors until you reach the very end.” At 16 minutes per test, this was a bit annoying.

Matt’s cheerfully verbose printf() statements finally found the culprit: a statically-allocated buffer (declared in the XS file that glued the library to Perl) that was just too damn small for a particular combination of the current input data. Writes into the buffer weren’t being checked for errors until just before returning to Perl.

The script was processing all of the days that fit into the buffer, and then choking on the one that didn’t fit. In the previous incidents where it had recovered on its own, it had simply downloaded fresh source data that generated slightly smaller output than the preceding runs.

We made the buffer larger, watched the script complete successfully, and at 9pm we met our goal and sent the data off. I composed a lengthy email to all of the managers who’d gotten involved, clearing up exactly what was and wasn’t broken, and assuring the (rightfully suspicious) satellite folks that the new data really was good, and should be sent up to the satellite immediately.

About an hour later, I realized something terrible: it had been crashing on Saturday’s data file. The Sunday run wouldn’t use that data, so if we hadn’t figured it out that night, the problem would have silently vanished again, without explanation.