A few small repairs…


Early yesterday morning, I got email from a server whose RAID 10 array was rebuilding. As far as I could tell, one of the SSDs had briefly gone offline, just long enough to force the controller to resync it.

Mildly disturbing. I told the team to make sure we had a cold spare ready to go, and we should prep to swap it in if we saw anything else unusual before we had a chance to schedule a maintenance window.

Early this morning, two SSDs failed in that server, and this did not include the one that blipped yesterday. That reeks of RAID controller failure, and since we didn’t have an identical one on hand with identical firmware, the best bet was moving the whole damn thing to completely different hardware (more precisely, our shiny new VMware/Tegile cluster).

Fortunately it’s only half a terabyte, backed up at least three different ways, and everything’s on a 10G network, but pretty much all of engineering is twiddling their thumbs until it’s back, so “no pressure”.

Update

Start to finish, it took about 7 hours from the time we pulled the trigger on the move. A good chunk of that was spent checksumming the data and copying back the dozen or so files that were corrupted.

Now I’m just watching the rsync backup run with “-c” to make sure the corrupted data didn’t propagate. Honestly, it would be faster to blow away the destination and do a regular rsync, but then I’d have one less mostly-valid backup for N hours. I don’t really care how long it takes to run, and doing it this way reassures any management types who ask questions later.


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.