Wednesday, August 13 2003

A perfectly reasonable panic

Once every three months, we sent the whole company home while we tore the computer room apart and did all sorts of maintenance work. During my first quarterly downtime, the top item on my list was installing a new BOSS controller into the Solbourne that was our primary Oracle database server. Like any good database, it needed an occasional disk infusion to keep it happy, and there was no room on the existing SCSI controllers.

So I had a disk tray, a bunch of shiny new disks, a controller card, and media to upgrade the OS with. The BOSS was only supported in the latest version, and this being the server that kept the books, it was upgraded only when necessary.

For those who never got to play with Solbourne computers, the short explanation is that the company had figured out how to molest SunOS 4 to support multiple processors. Not the fancy symmetric multiprocessing that kids can buy on the street today, but still a serious performance gain for things like Oracle servers. The hardware could be a bit twitchy, and swapping parts between two allegedly-identical models might convert two working servers into crashaholics, but they were worth it.

Obviously, I made a full backup of the system before starting the required OS upgrade. By great good fortune (okay, someone planned ahead), there was a spare partition on one of the disks that was big enough to install the new OS onto, so that if something went wrong, I could revert to the old OS without restoring from tape.

Good thing, too, because it didn’t work. The OS upgrade went fine, but the BOSS controller didn’t work at all. We were using them in several other servers, and a merry hour of parts-swapping suggested that the ROMs on the I/O board were too old. I confirmed this by booting it with the I/O board out of another server, but unfortunately I couldn’t keep it, and we didn’t have any more ROMs in stock.

So I disconnected the new hardware, rolled back to the old OS, and logged in to check things out. The first thing I noticed was that all the board-swapping had reset the clock, so it thought the date was December 31, 1969. I fixed that and continued poking around, and just as I was convinced the machine was fine, it panicked and crashed.

Not good. Had all my parts-swapping destroyed the stability of the most important database server in the company? I scrolled back in the console log, read the panic message, and started to laugh.

It looked something like this: watchdog reset: CPU timeout….

As I said, this wasn’t modern symmetric multiprocessing. There was a master CPU and a slave CPU, and they were crudely synchronized. About thirty seconds after I fixed the clock, the slave woke up and realized that it hadn’t heard from the master in more than twenty years. I couldn’t really blame it for panicking.