Sysadmin

"But I'm trying, Ringo..."


In my many years of interviewing sysadmin candidates, the most important qualification, and the hardest to explain in terms that make sense to HR, has been “do they think right?”. The core of it, I think, is the attitude with which they approach diagnosing and solving problems.

Behavioral interviewing techniques can produce some useful information about problems they’ve dealt with in the past, but not so much on how they really got from X to Y. HR gets very nervous if you do anything that even looks like a direct test of an applicant’s abilities, so the best approach I’ve found is to swap problem-solving stories and pay close attention not only to the ones they choose to tell, but to the things they say about mine.

I don’t care what, if any, degree you’ve got. I don’t care who you’ve worked for, who you know, what certification programs you’ve completed, or how precisely your last job matched our current requirements. If you think right, and you’re not a complete idiot, I can work with you.

The two people who’ve been hired to replace the five of us are not complete idiots. One of them shows signs of thinking right.

"Configured Maximum Connection Time Exceeded"


I was working from home yesterday, and connecting to the office via VPN. In the past, this hasn’t been a big deal. This time, just as I was getting set up to start Something Important, the connection went down. Deliberately.

Secure VPN Connection terminated by Peer.
Reason 430: Configured Maximum Connection Time Exceeded.

Connection terminated on Sep 28, 2006 18:28:40   Duration: 0 day(s), 08:00.12

Good thing they bought a new VPN server to replace this one. Oh, wait; the new one’s still in beta, has been for months, and recently stopped working. Feh.

Pedestrian On Pavement


I got a ticket yesterday. More precisely, I got a fake ticket yesterday, because it was the only way for the cops to get the crazy angry person to shut up and go away.

I had a little work project that was kind of important. Namely, I needed to get over half a million dollars worth of servers packed up and loaded onto a truck (the same ones that were supposed to be shipped out on Friday). To do that, we needed to park the truck. Unfortunately, just as we were pulling into the commercial loading zone that we’d been patiently waiting for for twenty minutes, some clown in an SUV whips around the truck and starts backing into it.

I stepped out into the street and waved him off. He kept coming, until his bumper was about three inches from my body. Then he jumped out and furiously accused me of trying to steal his parking space, shouted at everyone within reach (including a completely unrelated moving company that was working across the street), and then ran off claiming he was going to find a cop to take care of me, leaving his car blocking both the parking spot and the street.

We found a cop first. When he returned with his dry-cleaning (he later claimed he really was making a commercial delivery, but that box never left the back of his SUV, and the cop saw him picking up the suit…), she was already writing up his ticket, and informing him that he was two minutes away from being towed.

He shouted at her. He shouted at us. He shouted at her sergeant, when he showed up. He harangued the bums on the sidewalk, telling them what horrible people and criminals we were. He tried to get the cop to give my truck driver a ticket for blocking the road. He tried to get the cop to give me a ticket for illegally attempting to reserve a parking space.

He got several tickets, which he’ll have to pay for. To shut him up, they wrote out a phony ticket for me, which will be dismissed when the cop deliberately fails to appear in court (her exact words: “this is bullshit, don’t pay it”). He tried to get my name so he could go after me personally, and the cop patiently explained that he had no right to that information.

And to think that this was actually better than my day Friday, which involved the world’s most carelessly ambitious contract Unix sysadmin trying to get me to let him work unsupervised as root on a production server that I’m responsible for (“Hi, Mark!”).

How not to move servers


Tip for the day: when you’ve arranged for a professional computer moving company to relocate 30 critical servers from one state to another, and the driver shows up alone in a bare panel truck, without even a blanket to keep the machines from bouncing around on their way to the warehouse, do not let him take them.

The driver was as surprised as I was, perhaps more so. He thought we had a shrink-wrapped pallet of boxes that could be popped onto the truck and dropped off after he made a few more stops. The dispatcher tried to talk him into loading the stuff loose. The dispatcher tried to talk me into letting the driver load the stuff loose, swearing that it would be fine for the short trip to the warehouse.

Things went downhill from there.

They still don't get it...


[last update: the root cause of the Linux loopback device problem described below turns out to be simple: there’s no locking in the code that selects a free loop device. So it doesn’t matter whether you use mount or losetup, and it doesn’t matter how many loop devices you configure; if you try to allocate two at once, one of them will likely fail.]

Panel discussions at LinuxWorld (emphasis mine):

"We need to make compromises to do full multimedia capabilities like running on iPod so that non-technical users don’t dismiss us out of hand."

"We need to pay a lot more attention to the emerging markets; there’s an awful lot happening there."

But to truly popularize Linux, proponents will have to help push word of the operating system to users, panelists said.

... at least one proponent felt the Linux desktop movement needed more evangelism.

Jon “Maddog” Hall, executive director of Linux International, said each LinuxWorld attendee should make it a point to get at least two Windows users to the conference next year...

I’m sorry, but this is all bullshit. These guys are popping stiffies over an alleged opportunity to unseat Windows because of the delays in Vista, and not one of them seems to be interested in sitting down and making Linux work.

Not work if you have a friend help you install it, not work until the next release, not work with three applications and six games, not work because you can fix it yourself, not work if you can find the driver you need and it’s mostly stable, not work if you download the optional packages that allow it to play MP3s and DVDs, and definitely not work if you don’t need documentation. Just work.

[disclaimer: I get paid to run a farm of servers running a mix of RedHat 7.3 and Fedora Core 2/4/5. The machine hosting this blog runs on OpenBSD, but I’m toying with the idea of installing a minimal Ubuntu and a copy of VMware Server to virtualize the different domains I host. The only reason the base OS will be Linux is because that’s what VMware runs on. But that’s servers; my desktop is a Mac.]

Despite all the ways that Windows sucks, it works. Despite all the ways that Linux has improved over the years, and despite the very real ways that it’s better than Windows, it often doesn’t. Because, at the end of the day, somebody gets paid to make Windows work. Paid to write documentation. Paid to fill a room with random crappy hardware and spend thousands of hours installing, upgrading, using, breaking, and repairing Windows installations.

Open Source is the land of low-hanging fruit. Thousands of people are eager to do the easy stuff, for free or for fun. Very few are willing to write real documentation. Very few are willing to sit in a room and follow someone else’s documentation step-by-step, again and again, making sure that it’s clear, correct, and complete. Very few are interested in, or good at, ongoing maintenance. Or debugging thorny problems.

For instance, did you know that loopback mounts aren’t reliable? We have an automated process that creates EXT2 file system images, loopback-mounts them, fills them with data, and unmounts them. This happens approximately 24 times per day on each of 20 build machines, five days a week, every week. About twice a month it fails, with the following error: “ioctl: LOOP_SET_FD: Device or resource busy”.

Want to know why? Because mount -o loop is an unsupported method of setting up loop devices. It’s the only one you’ll ever see anyone use in their documentation, books, and shell scripts, but it doesn’t actually work. You’re supposed to do this:

LOOP=`losetup -f`
losetup $LOOP myimage
mount -t ext2 $LOOP /mnt
...
umount /mnt
losetup -d $LOOP

If you’re foolish enough to follow the documentation, eventually you’ll simply run out of free loop devices, no matter how many you have. When that happens, the mount point you tried to use will never work again with a loopback mount; you have to delete the directory and recreate it. Or reboot. Or sacrifice a chicken to the kernel gods.

Why support the mount interface if it isn’t reliable? Why not get rid of it, fix it, or at least document the problems somewhere other than, well, here?

[update: the root of our problem with letting the Linux mount command auto-allocate loopback devices may be that the umount command isn’t reliably freeing them without the -d option; it usually does so, but may be failing under load. I can’t test that right now, with everything covered in bubble-wrap in another state, but it’s worth a shot.]

[update: no, the -d option has nothing to do with it; I knocked together a quick test script, ran it in parallel N-1 times (where N was the total number of available loop devices), and about one run in three, I got the dreaded “ioctl: LOOP_SET_FD: Device or resource busy” error on the mount, even if losetup -a showed plenty of free loop devices.]

The Time Machine paradox


[disclaimer: developers who didn’t attend WWDC don’t have copies of the Leopard beta yet, and when they do send me one, I won’t be able to discuss how things actually work, so this is based solely on what Apple has stated on their web site]

When I heard the initial description of Apple’s upcoming Time Machine feature, it sounded like it was similar to NetApp Filer snapshots, or possibly the Windows volume shadow copy feature that’s been announced for Vista (and is already in use in Server 2003). The answers, respectively, are “not really” and “yes and no”.

Quoting:

The first time you attach an external drive to a Mac running Mac OS X Leopard, Time Machine asks if you’d like to back up to that drive.

Right from the start, Time Machine in Mac OS X Leopard makes a complete backup of all the files on your system.

As you make changes, Time Machine only backs up what changes, all the while maintaining a comprehensive layout of your system. That way, Time Machine minimizes the space required on your backup device.

Time Machine will back up every night at midnight, unless you select a different time from this menu.

With Time Machine, you can restore your whole system from any past backups and peruse the past with ease.

The key thing that they all have in common is the creation of copy-on-write snapshots of the data on a volume, at a set schedule. The key feature of NetApp’s version that isn’t available in the other two is that the backup is transparently stored on the same media as the original data. Volume Shadow Copy and Time Machine both require a separate volume to store the full copy and subsequent snapshots, and it must be at least as large as the original (preferably much larger).

NetApp snapshots and VSC have more versatile scheduling; for instance, NetApps have the concept of hourly, daily, and weekly snapshot pools that are managed separately, and both can create snapshots on demand that are managed manually. TM only supports daily snapshots, and they haven’t shown a management interface (“yet”, of course; this is all early-beta stuff).

VSC in its current incarnation is very enterprise-oriented, and it looks like the UI for gaining access to your backups is “less than user-friendly”. I’ve never seen a GUI method of accessing NetApp snapshots, and the direct method is not something I’d like to explain to a typical Windows or Mac user. TM, by contrast, is all about the UI, and may actually succeed in getting the point across about what it does. At the very least, when the family tech-support person gets a call about restoring a deleted file, there’s a chance that he can explain it over the phone.

One thing VSC is designed to do that TM might also do is allow valid backups of databases that never close their files. Apple is providing a TM API, but that may just be for presenting the data in context, not for directly hooking into the system to ensure correct backups.

What does this mean for Mac users? Buy a backup disk that’s much larger than your boot disk, and either explicitly exclude scratch areas from being backed up, or store them on Yet Another External Drive. What does it mean for laptop users? Dunno; besides the obvious need to plug it into something to make backups, they haven’t mentioned an “on-demand” snapshot mechanism, simply the ability to change the time of day when the backup runs. Will you be able to say “whenever I plug in drive X” or “whenever I connect to the corporate network”? I hope so. What does it mean for people who have more than one volume to back up? Not a clue.

Now for the fun. Brian complained that Time Machine is missing something, namely the ability to go into the future, and retrieve copies of files you haven’t made yet. Well, the UI might not make it explicit, but you will be able to do something just as cool: create alternate timelines.

Let’s say that on day X, I created a file named Grandfather, on X+20 a file named Father, and on X+40 a file named Me. On X+55, I delete Grandfather and Father. On X+65, I find myself missing Grandfather and bring him back. On X+70, I find myself longing for a simpler time before Father, and restore my entire system to the state it was in on X+19. There is no Father, there is no Me, only Grandfather. On X+80, I find myself missing Me, and reach back to X+69 to retrieve it.

We’re now living in Grandfather’s time (X+29, effectively) with no trace of Father anywhere on the system. Just Me.

Now for the terror: what happens if you set your clock back?

"Why is that server ticking?"


[this is the full story behind my previous entry on the trouble with tar pipelines]

One morning, I logged in from home to check the status of our automated nightly builds. The earliest builds had worked, but most of the builds that started after 4am had failed, in a disturbing way: the Perforce (source control) server was reporting data corruption in several files needed for the build.

This was server-side corruption, but with no obvious cause. Nothing in the Perforce logs, nothing in dmesg output, nothing in the RAID status output. Nothing. Since the problem started at 4am, I used find to see what had changed, and found that 66 of the 600,000+ versioned data files managed by our Perforce server had changed between 4:01am and 4:11am, and the list included the files our nightly builds had failed on. There were no checkins in this period, so there should have been no changed files at all.

A quick look at the contents of the files revealed the problem: they were all truncated. Not to 0 bytes, but to some random multiple of 512 bytes. None of them contained any garbage, they just ended early. A 24-hour-old backup confirmed what they should have looked like, but I couldn’t just restore from it; all of those files had changed the day before, and Perforce uses RCS-style diffs to store versions.

[side note: my runs-every-hour backup was useless, because it kicked off at 4:10am, and cheerfully picked up the truncated files; I have since added a separate runs-every-three-hours backup to the system]

I was stumped. If it was a server, file system, RAID, disk, or controller error, I’d expect to see some garbage somewhere in a file, and truncation at some other size, perhaps 1K or 4K blocks. Then one of the other guys in our group noticed that those 66 files, and only those 66 files, were now owned by root.

Hmm, is there a root cron job that kicks off at 4am? Why, yes, there is! And it’s… a backup of the Perforce data! Several years ago, someone wrote a script that does an incremental backup of the versioned data to another server mounted via NFS. My hourly backups use rsync, but this one uses tar.

Badly:

cd ~perforce/really_important_source_code
find . -mtime -1 -print > $INTFILES
tar cpf - -T $INTFILES | (cd /mountpoint/subdir; tar xpf -)

Guess what happens when you can’t cd to /mountpoint/subdir, for any reason…

Useful information for getting yourself out of this mess: Perforce proxy servers store their cache in the exact same format as the main server, and even if they don’t contain every version, as long as someone has requested the current tip-of-tree revision through that proxy, the diffs will all match. Also, binary files are stored as separate numbered files compressed with the default gzip options, so you can have the user who checked it in send you a fresh copy. Working carefully, you can quickly (read: “less than a day”) get all of the data to match the MD5 checksums that Perforce maintains.

And then you replace that backup script with a new one…

Bad network! No donut!


The good news is that the folks at via.net rebooted something and got the ping times down from 500ms to 80ms. The bad news is that I’m still seeing 30% packet loss to every machine in their data center, so there’s more work to be done.

[update: ah, 17ms, no packet loss. much better]

[update: apparently there was a DDOS attack on one of the other servers they host.]

“Need a clue, take a clue,
 got a clue, leave a clue”