Friday, July 18 2003

Backups? What backups?

A.J. was worried. For several months, he’d been growing more and more concerned about the reliability of the Unix server backup system that he operated every day. He was just the latest in a long string of junior contractors paid to change tapes, but he actually cared about doing a good job, and something wasn’t right.

He had raised his concerns with the manager of Core Services and the Senior System Administrators who were responsible for the corporate infrastructure, but they assured him that any problems were only temporary, and that he should wait until they had the new system in place. A.J. resigned himself to pretending to do his job, and grudgingly agreed to stall for more time whenever a restore was requested that he couldn’t accomplish.

And then the system just stopped working.

One fine late-February morning, A.J. sat down in his cubicle and fired off the daily backup script. It said, more or less, “For today’s run, we’ve got 40 tapes worth of data; please load the following 0 tapes.” Not good. He carefully reviewed the procedure he’d been following for several months and tried again.

“…load the following 0 tapes.”

Deeply disturbed, he headed down to Core Services to seek help from the infrastructure team, but none of them were around, and no one knew where they were or when they’d be back. While searching vainly for them, he found Drew, who knew something about the backup system.

Drew ran the daily script and verified its new behavior. He ran a few other commands to see what was going on, and was surprised to learn that A.J. had never heard of any of them. “They’re not in my instructions anywhere.”

“Well, ” Drew said, “let’s forget about that for now and just see how well the system has been working recently. Where are the daily overview graphs?”

“What are those?”

“The printouts that the system generates every day showing how recently every server on the network has been backed up.”

“I’ve never seen anything like that,” said A.J. “I didn’t even know it had reports like that. Nobody told us.”

Beginning to understand the scope of the problem, Drew got worried. He wasn’t really an expert on the backup system, but he’d helped get it set up, and he knew this shouldn’t be happening. There were too many built-in safeguards for the system to just stop this way.

The contractor who had originally installed it and trained the first batch of operators was long gone, but Drew knew that one person in the company knew how it worked, because he’d helped write it: me.

Long ago and far away, I’d worked in a university Computer Science department that had needed a basically bulletproof backup system. Nothing had met our needs, so we rolled our own, and set it up so that our cheap and sometimes unreliable labor force (undergrads) couldn’t screw things up without raising red flags all over the place. Not only would the system draw you pretty pictures showing how current the backups were, individual servers would complain loudly whenever they noticed a lack of recent backups.

What had gone wrong?

Drew and A.J. found me and explained the situation. I ran all the same commands, confirmed all their claims, and got very, very worried. I had initially suspected some minor corruption in the tape database, but when I ran the recovery script, I discovered that it was hopelessly corrupt. With the daily reports also missing, we had nothing to tell us what file servers were on what tape, as of when.

Nothing online, that is. Remember how the system was designed to be bulletproof? Each tape written by the system had a label file at the beginning, containing full information about the contents, as well as its reference number in the database. I asked A.J. to fetch a few recent tapes for testing, so that once we knew they had good labels, we could slowly, painfully rebuild the database by hand.

Most of the tapes were blank. The few that had valid label files contained backups months older than our limited online records claimed. In a near-panic, I fired off a script that checked the dumpdates files on each and every file server. A few minutes later, we learned the terrible truth: two thirds of the servers hadn’t been backed up in six months, and we had no idea where to find the backups that did exist. Source code, financial data, customer bug reports, email, contracts; all were at risk.

Suddenly, we remembered a meeting that had taken place back in September. We were moving a bunch of servers into a new data center, and in the final planning meeting, a sysadmin from Core Services had stood up and told us that we should make backups of all of our servers before moving them, because their organization might not be able to provide restores. This did not go over well. Soon after the meeting ended, she had sent out email clarifying her earlier statement, insisting that she’d only meant that they would have difficulty locating the appropriate tapes quickly enough to meet the tight deadlines for this weekend move.

As we stared, shocked, at the screen, we realized that her first statement had been the simple truth, and the second a lie to cover it up. They knew. They’d known since September. And they’d allowed it to continue.

What did I do? I turned to A.J. and said, “bring me a box of blank tapes and a list of every working tape drive in the company, and don’t make plans for the weekend.” I quickly knocked together a Perl script to split up the file systems and get them onto tape, and we worked together until it was done. Then we cleaned up the script and started a rotation scheme, making absolutely certain that everything got backed up at least once a week.

What I didn’t do is something that I regret to this day: I didn’t push to have the Core Services manager fired for putting the entire company at risk with his gross negligence. For months he’d been deceiving the rest of the sysadmins, ordering a junior contractor to make false promises about restores, and pressuring the members of his own team to participate in the cover-up. What had he done in that time to actually fix the problem? Ordered an evaluation license of BudTool, and asked one of his people to try it out in her spare time.

It wasn’t until later in the week that I found the real smoking gun. One of the things that helped prevent the tape database from getting corrupt was a daily maintenance script that cleaned out garbage data and implemented the tape-recycling policy. It had been deliberately turned off at the beginning of January by one of the Core Services sysadmins, in the mistaken belief that the legal department didn’t want us to reuse tapes any more, instead preserving them forever. No one was ever able to explain where they’d gotten this idea from.

When the story got out, the idiot manager went into ass-covering mode with a vengeance. He Arranged Meetings. He Expressed Concerns. He Supported J’s Efforts To Implement An Interim Solution. He Organized A Project To Solve The Problem For Good. He did everything except take responsibility for creating the situation in the first place, and I was too busy repairing the damage to demand that he be taken out back and shot.

To no one’s surprise, the “new backup system project” was a complete fiasco. The only two people outside of Core who were on the team not only had every single one of their objections overruled, but the final report three months later was falsely presented as the unanimous recommendation of the team.

I left the company not long after that. A year later, they were still using the Perl scripts that I had knocked together that weekend. Why? Because the idiot manager had left the Core Services group and misplaced all of the bids for the new system, so they had to go back to the vendors and start over. The resulting system was about half the size it needed to be to actually back up everything, so they eventually had to cough up the cash for even more hardware, and then they had no real leverage with the vendor to get a good price.

About a year and a half later, my new company was looking for a manager for my group, and guess who applied for the job? Go on, take a wild guess.

Ready for the punchline? Among his recent accomplishments, he claimed that he’d managed the successful implementation of a new company-wide Unix backup system. I almost puked on his résumé.