Friday, January 17 2003

Spindles and Platters and Heads, oh my!

There’s a story I love to tell, a cautionary tale about an incompetent manager, his ass-covering sysadmins, and the company that they could have destroyed together. At some point I’ll write it up here, but the short version goes like this: “two-thirds of the file servers hadn’t been backed up in six months, and they knew this.”

(Continued on Page 36)

Sunday, July 6 2003

How not to write a job description…

…1995 Edition. This is what happens when your senior sysadmin leaves, and there’s no one left with even a tiny grasp of what the job involves. It happened to OSU-CIS; don’t let it happen to you!

(Continued on Page 16)

Friday, July 18 2003

Backups? What backups?

A.J. was worried. For several months, he’d been growing more and more concerned about the reliability of the Unix server backup system that he operated every day. He was just the latest in a long string of junior contractors paid to change tapes, but he actually cared about doing a good job, and something wasn’t right.

He had raised his concerns with the manager of Core Services and the Senior System Administrators who were responsible for the corporate infrastructure, but they assured him that any problems were only temporary, and that he should wait until they had the new system in place. A.J. resigned himself to pretending to do his job, and grudgingly agreed to stall for more time whenever a restore was requested that he couldn’t accomplish.

And then the system just stopped working.

(Continued on Page 142)

Sometimes it’s not the network

My job was Unix support for Corporate Services, which basically referred to everything in the company that wasn’t related to developing, selling, or training customers how to use our products. In practice, though, it usually just meant MIS, because HR and Legal were composed entirely of Mac people, who had their own support team.

The oddest exception started one day when an HR manager asked me to help him set up a beta-test of a Lotus Notes-based applicant tracking system. The application was being developed on OS/2 servers and PC clients, but we wanted to test it with a SunOS server and Mac clients, since that’s what we had.

(Continued on Page 145)

Unrepentant sinner

Undergrads love free Internet porn. This is not news. Undergrads will go to great lengths to hide their porn collections from the sysadmins. This also is not news. Sometimes they outsmart themselves. This is just plain fun.

(Continued on Page 147)

Monday, July 21 2003

Sanitizing Apache with PF

About 45 minutes elapsed between the moment that I first turned this server on and the arrival of the first virus/worm/hacker probes. It was obvious that most of them were looking for Windows-based web servers, so they were harmless to me.

Still, I like to review the logs occasionally, and the sheer volume of this crap was getting annoying. Later, when I raised munitions.com from the dead, I discovered that it was getting more than 30,000 hits a day for a file containing the word “ok”. Worst of all, as I prepare to restore my photo archives, I know that I can’t afford to pay for the bandwidth while they’re slurped up by every search engine, cache site, obsessive collector, Usenet reposter, and eBay scammer on the planet.

Enter PF, the OpenBSD packet filter.

(Continued on Page 149)

Thursday, July 31 2003

Tales From The Help Desk

E-mail exchange between user and sysadmin at OSU-CIS, long ago and far away…

User: I was wondering how to send mail to someone on the VAX systems.

Sysadmin: Which ones?

U: It’s the VAX 386 systems. I know the three unique letters to identify this person. Thanks.

S: That doesn’t help. Perhaps I should instead have asked whose VAX systems.

U: It is the VAX at BF Goodrich in Avon Lake, OH. Hope this helps.

S: (crycrycry)

Why I Love Users, a parable of hacking

(based on a true story from my OSU-CIS days…)

User A notices that the department has installed a new sprinkler system. He immediately proceeds to find out everything about how it works, what it can do, and how reliable it is. People are astonished at how much he knows about it, and he basks in the warm glow of praise. One day, he uncovers a serious implementation flaw that no one knows about, and makes veiled references to it for several months, never to the people who are in a position to fix it. Finally, he decides to show people how bad the system is, and sets fire to the building. He’s careful to make sure that no one gets hurt, and that the damage is minor. When the fire-fighters approach him with blood in their eyes and axes in their hands, he smiles quietly and says, “I told you so; you should have listened.”

This being just a story, I feel compelled to permit the fire-fighters to hack the little toad to pieces, shouting “LIKE HELL YOU DID!”

The moral of this story is a variation on the Golden Rule:

“Do unto others as you would have them do unto you, because they can do unto you a lot harder.”

Saturday, August 9 2003

The Perl Script From Hell

I’ve been working with Perl since about two weeks before version 2.0 was released. Over those fifteen years, I’ve seen a lot of hairy Perl scripts, many of them mine.

None of them can compare to the monster that lurks in the depths of our service, though. Over 8,000 lines of Perl plus an 8,000-line C++ module, written in a style that’s allegedly Object Oriented, but which I would describe as Obscenely Obfuscated (“Hi, Andrew!”).

We have five large servers devoted to running it. Each contributes three CPUs, three gigabytes of memory, and 25 hours of runtime to the task (independently; we need the redundancy if one of them crashes). Five years ago, I swore a mighty oath to never, ever get involved with the damned thing.

Then it broke. In a way that involved tens of thousands of unhappy customers.

(Continued on Page 1511)

Wednesday, August 13 2003

A perfectly reasonable panic

Once every three months, we sent the whole company home while we tore the computer room apart and did all sorts of maintenance work. During my first quarterly downtime, the top item on my list was installing a new BOSS controller into the Solbourne that was our primary Oracle database server. Like any good database, it needed an occasional disk infusion to keep it happy, and there was no room on the existing SCSI controllers.

So I had a disk tray, a bunch of shiny new disks, a controller card, and media to upgrade the OS with. The BOSS was only supported in the latest version, and this being the server that kept the books, it was upgraded only when necessary.

(Continued on Page 1518)

Tuesday, August 26 2003

“If I could get in there…”

While watching yet another Slashdot thread dissolve into a poor imitation of a Usenet flame-war, the smug arrogance of people who think that running Linux means they’re smarter than Windows users reminded me of something that happened when I was at Synopsys.

A widely-used Unix server had crashed, and the engineers were hanging out near the data center, waiting for us to bring it back up.

“What’s taking them so long? We’ve got work to do! Dammit, if I could get in there, I’d fix it myself!”
“I’m pretty sure that’s why you can’t get in there.”

Friday, September 5 2003

I love this kind of bug…

People often wonder what sysadmins do for a living. It’s a mostly-invisible profession, where you’re only noticed when things aren’t working. Mostly we solve problems, but often we first have to figure out what the problem really is.

I don’t want to know how long it took someone to get from “my password doesn’t work” to this:

If you used Open Firmware Password utility to create a password that contains the capital letter “U”, your password will not be recognized during the startup process (when you try to access Startup Manager, for example).

Note that it applies to Mac models going back several years, but wasn’t posted on the support site until this week. No doubt there’s a small pile of bug reports that have been sitting around for all this time, with their status field set to “WTF?”.

Tuesday, September 16 2003

Today is a good day

New PowerBooks are out. Must wet pants with joy. They all look good, but I’m leaning slightly toward a 15” model with an 80GB disk and 1GB of RAM; not sure I’m ready for a 17” boat anchor.

Yesterday, on the other hand, was definitely not a good day. For some time now, I’ve been installing Panther betas on my iBook with the Archive & Install option, which preserves almost all of my applications and customizations while completely replacing the OS. I’ve always backed up my home directory first, but haven’t bothered with an extra full backup. Cuts the total upgrade time down to about an hour, most of which is spent watching the disks spin.

On another day, I’d consider including a comparison to my last Windows upgrade horror story. Unfortunately, things went terribly wrong this time. Twelve hours later, my iBook is almost back to normal.

(Continued on Page 1585)

Saturday, March 6 2004

Latchkey Zombies in Solaris

A funny thing happened when we upgraded our servers from Solaris 2.5.1 several years ago: when we killed a process, frequently its parent wouldn’t notice. This was annoying, since a lot of our Operations processes were built around killing and restarting services so they’d notice changes in a controlled fashion.

(Continued on Page 1807)

Friday, April 16 2004

It smells like… victory

Last July, I knocked together a small perl script to monitor my Apache logs for virus probes, rude robots, and other annoyances, and automatically add their IP addresses to my firewall’s block list.

Today I spotted a very unusual entry at the bottom of my referrer report. I was morbidly curious what someone at a commercial web site devoted to she-males would be linking to, but it turns out the answer is “nothing”. Someone in China was running a robot that pretended to be a Windows 98 box while recursively downloading my site, no doubt to encourage My Loyal Readers (all six of them) to visit this fascinating site.

Unfortunately for my hopeful new friend, his robot tripped my log monitor and triggered a block, preventing him from getting more than a few hits. Even more unfortunately, I don’t display recent referrers anywhere on this site, so I’m the only person who knows what site he’s being paid to direct traffic to.

And I’m not going to tell. But it’s registered to someone named Dmitri Kukushkin in Delaware, who owns at least one other fetish domain.

Thursday, January 20 2005

Connect:Direct for Dummies

I’ve been roped into supporting a project that requires the use of Connect:Direct to transfer data to an external partner. This product is vastly overcomplicated for the use we’re putting it to, and the documentation feels like it was written as an ad for the vendor’s training courses.

I have no interest in becoming an expert Connect:Direct administrator. I want to do two things: configure the Unix command-line client to connect to our partner’s server, so that we can send a file to them, and configure the Unix server so that the partner can connect to us and send the processed data back.

This is turning out to be surprisingly difficult to do. A lot of it is the documentation, but a disturbing percentage of the problem is the near-total lack of information available from our partner. You’d think that a large company that required their customers to purchase and set up a specific software package (that they had no other use for) would supply a one-page cheat-sheet, but these folks haven’t even managed to cough up the userid and password we’re supposed to connect with. For more fun, they say there’s a guy in their security department who knows all about Connect:Direct, but he’s not allowed to talk to external customers.

So, anybody have a friend who knows something about this stuff? Bonus points if you can guess the name of the company we’re trying to connect to. :-)

Update: After giving up on their documentation and our partner’s knowledge pool, I opened a support case with the vendor. Their tech support called my office at 6am this morning, not realizing what time zone I was in. Fortunately, my office phone forwards to my cell, and I was actually awake at the time. Five minutes later, I not only had the original error message deciphered (XSMG242I, which can mean any of “bad permissions”, “config-file syntax error”, “missing remote record for local user”, and others), but had an understanding of their security and connection models that could not be obtained from their documentation. Thank you, Moniram. When our contact at the partner woke up a few hours later, we were able to successfully test file transfers in both directions.

I mentioned my intention to clean up my notes into a “Connect:Direct for Dummies” guide that could be used to rebuild our servers if I were unavailable, and our partner has expressed an interest in acquiring a copy. They’d like to help other customers cut down the setup time from weeks to minutes…

Friday, February 18 2005

I always knew they were real

Everywhere I’ve worked, people believe in them. They’re the ones who clear jams, change toner cartridges, reload the paper trays, and clean up the messy pile of abandoned printouts, and finally they’ve been captured on film. I give you…

(Continued on Page 2247)

Saturday, July 9 2005

Minolta Maxxum 7D glitch

[Update 7/23/05: okay, the rule of thumb seems to be, “if you can’t handhold a 50mm f/1.4 at ISO 100-400 and get the shot, spot-meter off a gray card and check the histogram before trusting the exposure meter”. This suggests some peculiarities in the low-light metering algorithm, which is supported by the fact that flash exposures are always dead-on, even in extremely dim light.]

[Update 7/22/05: after fiddling around with assorted settings, resetting the camera, and testing various lenses with a gray card, the camera’s behavior has changed. Now all the lenses are consistently underexposing by 2/3 of a stop. This is progress of a sort, since I can freely swap lenses and get excellent exposures… as long as I set +2/3 exposure compensation. I think my next step is going to be reapplying the firmware update. Sigh.]

The only flaw I’ve noticed in my 7D was what looked at first like a random failure in the white-balancing system. Sometimes, as I shot pictures around the house, the colors just came out wrong, and no adjustment seemed to fix it in-camera.

Tonight, I started seeing it consistently. I took a series of test shots (starting with the sake bottle, moving on to the stack of Pocky boxes…) at various white balance settings, loaded them into Photoshop, and tried to figure out what was going on. Somewhere in there, I hit the Auto Levels function, and suddenly realized that the damn thing was simply underexposing by 2/3 to 1 full stop.

Minolta has always been ahead of the curve at ambient-light exposure metering, which is probably why I didn’t think of that first. It just seemed more reasonable to blame a digital-specific feature than one that they’ve been refining for so many years.

With that figured out, I started writing up a bug report, going back over every step to provide a precise repeat-by. Firmware revision, lens, camera settings, test conditions, etc. I dug out my Maxxum 9 and Maxxum 7 and mounted the same lens, added a gray card to the scene, and even pulled out my Flash Meter V to record the guaranteed-correct exposure. All Minolta gear, all known to produce correct exposures.

Turns out it’s the lens. More precisely, my two variable-aperture zoom lenses exhibited the problem (24-105/3.5-4.5 D, 100-400/4.5-6.7 APO). The fixed focal-length lenses (50/1.4, 85/1.4, 200/2.8) and fixed-aperture “pro” zoom lenses (28-70/2.8, 80-200/2.8) worked just fine with the 7D, on the exact same scene. Manually selecting the correct exposure with the variable-aperture zooms worked as well.

These are the sort of details that make a customer service request useful to tech support. I know I’m always happier when I get them.

Wednesday, July 27 2005

Adobe Version Cue 2: here we go again…

Apparently the folks at Adobe haven’t learned anything about computer security since I looked at the first release of Version Cue. After I installed the CS2 suite last night, I was annoyed at what I found.

Listens on all network interfaces by default? Check. Exposes configuration information on its web-administration page? Check. Defaults to trivial password on the web-admin page? Check. Actually prints the trivial default password on the web-admin page? Check. Defaults to sharing your documents with anyone who can connect to your machine? Check. I could go on, but it’s too depressing.

The only nice thing I can say about it is that it doesn’t add a new rule to the built-in Mac OS X firewall to open up the ports it uses. As a result, most people will be protected from this default stupidity.

Thursday, November 3 2005

My need for fluff and fan-service

On Tuesday, a server we rely on that’s located in another state, under someone else’s control, went *poof*. They have another machine we can upload to, though, so I changed all references to point to it.

All the ones I knew about, that is. A little-used script in a particular branch of our software had a hardcoded reference to the dead host, which it used to download previous uploads to produce a small delta release. The result, of course, was a failure Wednesday that left the QA group twiddling their thumbs until I could fix things. In the end, other failures turned up that prevented them from getting the delta release, but they could live with a full release, and that’s what they got.

That was my day from about 7am to 2pm, not counting the repeated interruptions as I explained to people that the backup server we were uploading to had about half the bandwidth of the usual connection, so data was arriving more slowly.

Things proceeded normally for a few hours, until the next fire at 4:30pm. A server responsible for half a dozen test builds and two release builds had a sudden attack of amnesia, forgetting that a 200GB RAID volume was supposed to be full of data. A disk swap brought it back to life as a working empty volume, but by that time I’d moved all the builds to other machines. I’ll test it today before putting it back in service.

Just as I was finishing up with that mess and verifying that the builds would work in their new homes, our primary internal DNS/NIS server went down. The poor soul who’d just finished rebuilding my RAID volume had barely gotten back to his desk when he had to walk three blocks back to the data center. Once that machine was healthy again, I cleaned up some lock files so that test builds would resume, and waited for the email telling me what was supposed to be on the custom production CD-ROM they’re shipping overseas today.

That, of course, was IT’s cue to take down the mail server for maintenance. Planned and announced, of course, but also open-ended, so I had no idea when it would be back. Didn’t matter, though, because then my DSL line went down. I’d never made it out of the house, you see, and was doing all of this remotely.

The email I was waiting for went out at 9:30pm, I got it at 10:45pm, and kicked the custom build off at 11pm. It finished building at 12:30am and started the imaging process, which makes a quick query to the Perforce server.

Guess what went down for backups at 12am, blocking all requests until it completed at around 3am? Nap time for J!

At 4:45am I woke up, checked the image, mailed off a signing request so it could actually be used to boot a production box, set the alarm clock for 6:45am, and went back to sleep.

This was not a day for deep, thought-provoking anime. It was a day for Grenadier disc 2 and Maburaho disc 4 (which arrived from Anime Corner Store just about the time the mail server went down). I considered getting started on DearS disc 2 and Girls Bravo disc 3, which also showed up, but decided instead to make a badly-needed grocery run.

Friday, March 24 2006

Server dog slow today

I’m getting consistent 190ms pings to my server, despite 10ms pings to the router its connected to. It’s not server load, it’s not the bandwidth throttling rules in my firewall config, and I’m not seeing any errors in netstat or dmesg output. My best guess right now is a duplex mismatch on the switch. I’m waiting to hear back from the network guys.

Update: supporting evidence for my switch theory: Scott’s machine in the same rack, recycledbits.org, has the same problem, and I get 360ms pings from mine to his, without ever touching a router.

On the “damn nuisance” front, however, email to ViaNet tech support comes back with one of those stupid challenge/response verification schemes. This is precisely the wrong approach for your primary tech-support contact method. Maybe if you’d actually answered the phone when I called, I wouldn’t mind so much, but come on, grab a clue, eh?

Update: oh, that’s much better.

Sunday, April 16 2006

Dear netpbm maintainers,

I hope I am not the first to point out just how pompous and wrong-headed the following statement is:

In Netpbm, we believe that man pages, and the Nroff/Troff formats, are obsolete; that HTML and web browsers and the world wide web long ago replaced them as the best way to deliver documentation. However, documentation is useless when people don’t know where it is. People are very accustomed to typing “man” to get information on a Unix program or library or file type, so in the standard Netpbm installation, we install a conventional man page for every command, library, and file type, but all it says is to use your web browser to look at the real documentation.

Translation: We maintain a suite of tools used by shell programmers, and we think that being able to read documentation offline or from the shell is stupid, so rather than maintain our documentation in a machine-readable format, we just wrote HTML and installed a bunch of “go fuck yourself” manpages.

On the bright side, they wrote their own replacement for the “man” command that uses Lynx to render their oh-so-spiffy documentation (assuming you’ve installed Lynx, of course), but they don’t even mention it in their fuck-you manpages. Oh, and the folks at darwinports didn’t know about this super-special tool, so they didn’t configure it in their netpbm install.

A-baka: “Hey, I know what we’ll do with our spare time! We can reinvent the wheel!”

B-baka: “Good idea, Dick! No one’s ever done that before, and everyone will praise us for its elegance and ideological purity, even though it’s incompatible with every other wheel-using device!”

A-baka: “We’re so cool!”

Update!: it keeps getting better. Many shell tools have some kind of help option that gives a brief usage summary. What do the Enlightened Beings responsible for netpbm put in theirs?

%  pnmcut --help
pnmcut: Use 'man pnmcut' for help.

Assholes.

Wednesday, April 26 2006

You’ve been a sysadmin too long when…

…while walking to the restroom in search of relief, you:

  1. spot a printer with a paper jam,
  2. fix the jam,
  3. wait to see if it’s fixed,
  4. clear the second jam,
  5. diagnose the problem,
  6. solve the problem,
  7. reload the paper tray,
  8. verify that it’s printing correctly.

Then you resume your trip to the restroom.

Thursday, June 15 2006

“This – is wrong tool. Never use this.”

Today, my life is devoted to cleaning up after an old automated daily backup script that included a line like this:

tar cpf - . | (cd /mount/subdir; tar xpf -)

Guess what happens when /mount/subdir doesn’t exist? “Hey, why are all these files truncated to some multiple of 512 bytes in size? And why are they now owned by root?”

Monday, June 26 2006

Bad network! No donut!

The good news is that the folks at via.net rebooted something and got the ping times down from 500ms to 80ms. The bad news is that I’m still seeing 30% packet loss to every machine in their data center, so there’s more work to be done.

[update: ah, 17ms, no packet loss. much better]

[update: apparently there was a DDOS attack on one of the other servers they host.]

Thursday, July 13 2006

“Why is that server ticking?”

[this is the full story behind my previous entry on the trouble with tar pipelines]

One morning, I logged in from home to check the status of our automated nightly builds. The earliest builds had worked, but most of the builds that started after 4am had failed, in a disturbing way: the Perforce (source control) server was reporting data corruption in several files needed for the build.

This was server-side corruption, but with no obvious cause. Nothing in the Perforce logs, nothing in dmesg output, nothing in the RAID status output. Nothing. Since the problem started at 4am, I used find to see what had changed, and found that 66 of the 600,000+ versioned data files managed by our Perforce server had changed between 4:01am and 4:11am, and the list included the files our nightly builds had failed on. There were no checkins in this period, so there should have been no changed files at all.

A quick look at the contents of the files revealed the problem: they were all truncated. Not to 0 bytes, but to some random multiple of 512 bytes. None of them contained any garbage, they just ended early. A 24-hour-old backup confirmed what they should have looked like, but I couldn’t just restore from it; all of those files had changed the day before, and Perforce uses RCS-style diffs to store versions.

[side note: my runs-every-hour backup was useless, because it kicked off at 4:10am, and cheerfully picked up the truncated files; I have since added a separate runs-every-three-hours backup to the system]

I was stumped. If it was a server, file system, RAID, disk, or controller error, I’d expect to see some garbage somewhere in a file, and truncation at some other size, perhaps 1K or 4K blocks. Then one of the other guys in our group noticed that those 66 files, and only those 66 files, were now owned by root.

Hmm, is there a root cron job that kicks off at 4am? Why, yes, there is! And it’s… a backup of the Perforce data! Several years ago, someone wrote a script that does an incremental backup of the versioned data to another server mounted via NFS. My hourly backups use rsync, but this one uses tar.

Badly:

cd ~perforce/really_important_source_code
find . -mtime -1 -print > $INTFILES
tar cpf - -T $INTFILES | (cd /mountpoint/subdir; tar xpf -)

Guess what happens when you can’t cd to /mountpoint/subdir, for any reason…

Useful information for getting yourself out of this mess: Perforce proxy servers store their cache in the exact same format as the main server, and even if they don’t contain every version, as long as someone has requested the current tip-of-tree revision through that proxy, the diffs will all match. Also, binary files are stored as separate numbered files compressed with the default gzip options, so you can have the user who checked it in send you a fresh copy. Working carefully, you can quickly (read: “less than a day”) get all of the data to match the MD5 checksums that Perforce maintains.

And then you replace that backup script with a new one…

Wednesday, August 9 2006

The Time Machine paradox

[disclaimer: developers who didn’t attend WWDC don’t have copies of the Leopard beta yet, and when they do send me one, I won’t be able to discuss how things actually work, so this is based solely on what Apple has stated on their web site]

When I heard the initial description of Apple’s upcoming Time Machine feature, it sounded like it was similar to NetApp Filer snapshots, or possibly the Windows volume shadow copy feature that’s been announced for Vista (and is already in use in Server 2003). The answers, respectively, are “not really” and “yes and no”.

Quoting:

The first time you attach an external drive to a Mac running Mac OS X Leopard, Time Machine asks if you’d like to back up to that drive.

Right from the start, Time Machine in Mac OS X Leopard makes a complete backup of all the files on your system.

As you make changes, Time Machine only backs up what changes, all the while maintaining a comprehensive layout of your system. That way, Time Machine minimizes the space required on your backup device.

Time Machine will back up every night at midnight, unless you select a different time from this menu.

With Time Machine, you can restore your whole system from any past backups and peruse the past with ease.

The key thing that they all have in common is the creation of copy-on-write snapshots of the data on a volume, at a set schedule. The key feature of NetApp’s version that isn’t available in the other two is that the backup is transparently stored on the same media as the original data. Volume Shadow Copy and Time Machine both require a separate volume to store the full copy and subsequent snapshots, and it must be at least as large as the original (preferably much larger).

NetApp snapshots and VSC have more versatile scheduling; for instance, NetApps have the concept of hourly, daily, and weekly snapshot pools that are managed separately, and both can create snapshots on demand that are managed manually. TM only supports daily snapshots, and they haven’t shown a management interface (“yet”, of course; this is all early-beta stuff).

VSC in its current incarnation is very enterprise-oriented, and it looks like the UI for gaining access to your backups is “less than user-friendly”. I’ve never seen a GUI method of accessing NetApp snapshots, and the direct method is not something I’d like to explain to a typical Windows or Mac user. TM, by contrast, is all about the UI, and may actually succeed in getting the point across about what it does. At the very least, when the family tech-support person gets a call about restoring a deleted file, there’s a chance that he can explain it over the phone.

One thing VSC is designed to do that TM might also do is allow valid backups of databases that never close their files. Apple is providing a TM API, but that may just be for presenting the data in context, not for directly hooking into the system to ensure correct backups.

What does this mean for Mac users? Buy a backup disk that’s much larger than your boot disk, and either explicitly exclude scratch areas from being backed up, or store them on Yet Another External Drive. What does it mean for laptop users? Dunno; besides the obvious need to plug it into something to make backups, they haven’t mentioned an “on-demand” snapshot mechanism, simply the ability to change the time of day when the backup runs. Will you be able to say “whenever I plug in drive X” or “whenever I connect to the corporate network”? I hope so. What does it mean for people who have more than one volume to back up? Not a clue.

Now for the fun. Brian complained that Time Machine is missing something, namely the ability to go into the future, and retrieve copies of files you haven’t made yet. Well, the UI might not make it explicit, but you will be able to do something just as cool: create alternate timelines.

Let’s say that on day X, I created a file named Grandfather, on X+20 a file named Father, and on X+40 a file named Me. On X+55, I delete Grandfather and Father. On X+65, I find myself missing Grandfather and bring him back. On X+70, I find myself longing for a simpler time before Father, and restore my entire system to the state it was in on X+19. There is no Father, there is no Me, only Grandfather. On X+80, I find myself missing Me, and reach back to X+69 to retrieve it.

We’re now living in Grandfather’s time (X+29, effectively) with no trace of Father anywhere on the system. Just Me.

Now for the terror: what happens if you set your clock back?

Thursday, August 17 2006

They still don’t get it…

[last update: the root cause of the Linux loopback device problem described below turns out to be simple: there’s no locking in the code that selects a free loop device. So it doesn’t matter whether you use mount or losetup, and it doesn’t matter how many loop devices you configure; if you try to allocate two at once, one of them will likely fail.]

Panel discussions at LinuxWorld (emphasis mine):

“We need to make compromises to do full multimedia capabilities like running on iPod so that non-technical users don’t dismiss us out of hand.”

“We need to pay a lot more attention to the emerging markets; there’s an awful lot happening there.”

But to truly popularize Linux, proponents will have to help push word of the operating system to users, panelists said.

… at least one proponent felt the Linux desktop movement needed more evangelism.

Jon “Maddog” Hall, executive director of Linux International, said each LinuxWorld attendee should make it a point to get at least two Windows users to the conference next year

I’m sorry, but this is all bullshit. These guys are popping stiffies over an alleged opportunity to unseat Windows because of the delays in Vista, and not one of them seems to be interested in sitting down and making Linux work.

Not work if you have a friend help you install it, not work until the next release, not work with three applications and six games, not work because you can fix it yourself, not work if you can find the driver you need and it’s mostly stable, not work if you download the optional packages that allow it to play MP3s and DVDs, and definitely not work if you don’t need documentation. Just work.

[disclaimer: I get paid to run a farm of servers running a mix of RedHat 7.3 and Fedora Core 2/4/5. The machine hosting this blog runs on OpenBSD, but I’m toying with the idea of installing a minimal Ubuntu and a copy of VMware Server to virtualize the different domains I host. The only reason the base OS will be Linux is because that’s what VMware runs on. But that’s servers; my desktop is a Mac.]

Despite all the ways that Windows sucks, it works. Despite all the ways that Linux has improved over the years, and despite the very real ways that it’s better than Windows, it often doesn’t. Because, at the end of the day, somebody gets paid to make Windows work. Paid to write documentation. Paid to fill a room with random crappy hardware and spend thousands of hours installing, upgrading, using, breaking, and repairing Windows installations.

Open Source is the land of low-hanging fruit. Thousands of people are eager to do the easy stuff, for free or for fun. Very few are willing to write real documentation. Very few are willing to sit in a room and follow someone else’s documentation step-by-step, again and again, making sure that it’s clear, correct, and complete. Very few are interested in, or good at, ongoing maintenance. Or debugging thorny problems.

For instance, did you know that loopback mounts aren’t reliable? We have an automated process that creates EXT2 file system images, loopback-mounts them, fills them with data, and unmounts them. This happens approximately 24 times per day on each of 20 build machines, five days a week, every week. About twice a month it fails, with the following error: “ioctl: LOOP_SET_FD: Device or resource busy”.

Want to know why? Because mount -o loop is an unsupported method of setting up loop devices. It’s the only one you’ll ever see anyone use in their documentation, books, and shell scripts, but it doesn’t actually work. You’re supposed to do this:

LOOP=`losetup -f`
losetup $LOOP myimage
mount -t ext2 $LOOP /mnt
...
umount /mnt
losetup -d $LOOP

If you’re foolish enough to follow the documentation, eventually you’ll simply run out of free loop devices, no matter how many you have. When that happens, the mount point you tried to use will never work again with a loopback mount; you have to delete the directory and recreate it. Or reboot. Or sacrifice a chicken to the kernel gods.

Why support the mount interface if it isn’t reliable? Why not get rid of it, fix it, or at least document the problems somewhere other than, well, here?

[update: the root of our problem with letting the Linux mount command auto-allocate loopback devices may be that the umount command isn’t reliably freeing them without the -d option; it usually does so, but may be failing under load. I can’t test that right now, with everything covered in bubble-wrap in another state, but it’s worth a shot.]

[update: no, the -d option has nothing to do with it; I knocked together a quick test script, ran it in parallel N-1 times (where N was the total number of available loop devices), and about one run in three, I got the dreaded “ioctl: LOOP_SET_FD: Device or resource busy” error on the mount, even if losetup -a showed plenty of free loop devices.]

Friday, August 25 2006

How not to move servers

Tip for the day: when you’ve arranged for a professional computer moving company to relocate 30 critical servers from one state to another, and the driver shows up alone in a bare panel truck, without even a blanket to keep the machines from bouncing around on their way to the warehouse, do not let him take them.

The driver was as surprised as I was, perhaps more so. He thought we had a shrink-wrapped pallet of boxes that could be popped onto the truck and dropped off after he made a few more stops. The dispatcher tried to talk him into loading the stuff loose. The dispatcher tried to talk me into letting the driver load the stuff loose, swearing that it would be fine for the short trip to the warehouse.

Things went downhill from there.

Tuesday, August 29 2006

Pedestrian On Pavement

I got a ticket yesterday. More precisely, I got a fake ticket yesterday, because it was the only way for the cops to get the crazy angry person to shut up and go away.

I had a little work project that was kind of important. Namely, I needed to get over half a million dollars worth of servers packed up and loaded onto a truck (the same ones that were supposed to be shipped out on Friday). To do that, we needed to park the truck. Unfortunately, just as we were pulling into the commercial loading zone that we’d been patiently waiting for for twenty minutes, some clown in an SUV whips around the truck and starts backing into it.

I stepped out into the street and waved him off. He kept coming, until his bumper was about three inches from my body. Then he jumped out and furiously accused me of trying to steal his parking space, shouted at everyone within reach (including a completely unrelated moving company that was working across the street), and then ran off claiming he was going to find a cop to take care of me, leaving his car blocking both the parking spot and the street.

We found a cop first. When he returned with his dry-cleaning (he later claimed he really was making a commercial delivery, but that box never left the back of his SUV, and the cop saw him picking up the suit…), she was already writing up his ticket, and informing him that he was two minutes away from being towed.

He shouted at her. He shouted at us. He shouted at her sergeant, when he showed up. He harangued the bums on the sidewalk, telling them what horrible people and criminals we were. He tried to get the cop to give my truck driver a ticket for blocking the road. He tried to get the cop to give me a ticket for illegally attempting to reserve a parking space.

He got several tickets, which he’ll have to pay for. To shut him up, they wrote out a phony ticket for me, which will be dismissed when the cop deliberately fails to appear in court (her exact words: “this is bullshit, don’t pay it”). He tried to get my name so he could go after me personally, and the cop patiently explained that he had no right to that information.

And to think that this was actually better than my day Friday, which involved the world’s most carelessly ambitious contract Unix sysadmin trying to get me to let him work unsupervised as root on a production server that I’m responsible for (“Hi, Mark!”).

Thursday, September 28 2006

“Configured Maximum Connection Time Exceeded”

I was working from home yesterday, and connecting to the office via VPN. In the past, this hasn’t been a big deal. This time, just as I was getting set up to start Something Important, the connection went down. Deliberately.

Secure VPN Connection terminated by Peer.
Reason 430: Configured Maximum Connection Time Exceeded.

Connection terminated on Sep 28, 2006 18:28:40   Duration: 0 day(s), 08:00.12

Good thing they bought a new VPN server to replace this one. Oh, wait; the new one’s still in beta, has been for months, and recently stopped working. Feh.

Wednesday, October 25 2006

“But I’m trying, Ringo…”

In my many years of interviewing sysadmin candidates, the most important qualification, and the hardest to explain in terms that make sense to HR, has been “do they think right?”. The core of it, I think, is the attitude with which they approach diagnosing and solving problems.

Behavioral interviewing techniques can produce some useful information about problems they’ve dealt with in the past, but not so much on how they really got from X to Y. HR gets very nervous if you do anything that even looks like a direct test of an applicant’s abilities, so the best approach I’ve found is to swap problem-solving stories and pay close attention not only to the ones they choose to tell, but to the things they say about mine.

I don’t care what, if any, degree you’ve got. I don’t care who you’ve worked for, who you know, what certification programs you’ve completed, or how precisely your last job matched our current requirements. If you think right, and you’re not a complete idiot, I can work with you.

The two people who’ve been hired to replace the five of us are not complete idiots. One of them shows signs of thinking right.

Thursday, February 22 2007

Says it all, really…

I think this is the single finest example of premature optimization in existence today. From “Life with djbdns”:

The format of this datafile is documented at http://cr.yp.to/djbdns/tinydns-data.html. It looks a bit strange at first because it is not optimized to be readable by humans, but rather is optimized for parsing.

Following the link turns up this quote, which had me on the floor:

The data format is very easy for programs to edit, and reasonably easy for humans to edit, unlike the traditional zone-file format.

Yes, I think we can all agree that a colon-separated data file where whitespace is illegal and record type is indicated by a single ASCII character (+%.&=@-‘^CZ:) is easy for a simple-minded program to edit. I don’t think anyone with two brain cells to rub together can agree that it’s “reasonably easy for humans to edit”.

I must confess that, between his famous Usenet debut and my first look at the daemontools package he inflicts on all users of his software, I have never been particularly open-minded as to the merits of “the DJB way”. I’ve never heard a compelling technical reason for a site to abandon Bind and Postfix, and his advocates tend to have the glassy-eyed stare of veteran kool-aid drinkers, so until recently I hadn’t even bothered to look at his data formats.

The djbdns data file format? Fucking stupid.

Thursday, March 1 2007

IPsuck

I do not like IPSec. I do not understand IPSec. Sadly, cheap VPN routers purchased by external partners to whom we must give some access pretty much speak nothing else. [don’t get me started on packaged SSL VPN servers…]

Fortunately, our firewall runs a recent release of OpenBSD. Even more fortunately, there’s an excellent site on configuring OpenBSD as an IPSec server, including sample PF firewall rules.

I used a recent build of Parallels to set up a private, non-routed network with three virtual servers on it, put one of them on the real network as well, set it up as a firewall and router, and tinkered with a pair of Netgear VPN routers until they both could connect to one of the private servers without seeing the other.

Then I worked on the PF ruleset until I knew I could cut off either Netgear without affecting anything else, and transferred my configuration to our real-world firewall. Works like a charm.

It appears that the best way to use IPSec is to completely ignore all of its management features, set up a generic tunnel config, and handle all the access controls in your firewall. One less convoluted config-file syntax to learn, one less place to screw up and allow the wrong people to get at the wrong stuff.

Thursday, March 15 2007

In a Parallel Universe…

[Peem] whispers: wiped in TIB, can u tank 4 us

[Ferendo] whispers: Maybe, what’s up?

[Peem] whispers: need 2 clear 4 1st boss, u tank 6 whelps we dps

[Ferendo] whispers: That sounds easy. Go ahead and summon me.

Ferendo joins the party.

[Ferendo] says: Okay, where are the whelps?

[Peem] says: dead ahead, dood

[Ferendo] says: What, past that nest of elite dragonkin?

Peem points at Enraged Harbinger Whelp.

[Ferendo] says: Ah, right in the middle of the nest of elite dragonkin. That’s a problem.

[Peem] says: u said easy

[Ferendo] says: I said tanking six whelps would be easy. Nobody told me about the 18 elite dragonkin fireballing me and healing each other in the middle of the fight. All my fire resist gear put together won’t keep me alive for fifteen seconds in that, and there’s no way you can DPS them down before I die. And when I die, you die.

[Ferendo] says: Look, send a tell to my friend Akamai; he’s got a full set of Molten Core gear and some fire pots, and he could clean this room out with his eyes closed.

[Peem] says: tried, he said no pugs

[Ferendo] says: What’s your repair budget?

[Peem] says: ??

[Ferendo] says: Without the right gear, we’re going to wipe half a dozen times before they’re all down, and that’s going to cost me at least six gold.

[Peem] says: no cash, just got [Gaudy Shiv of the Poser] at AH

[Ferendo] says: Then you’re fucked. Sorry, guys, I’m out of here.

Ferendo leaves the party.

[Peem] whispers: u suck

Ferendo is now ignoring Peem.

Friday, March 23 2007

Buying Windows laptops for work…

Rory has ranted a bit about our recent laptop troubles. After giving up on those two companies, and not being able to fit ThinkPads into the budget, we looked for an alternative. These days, we’re also constrained by the desire to avoid becoming a mixed XP/Vista shop, so I went to the vendor who likes us the most, PC Connection, and sorted through their offerings.

The first “fix me now” user really, really wanted a lightweight machine, and had a strong affection for Bluetooth, so we bought him a Sony VAIO SZ340P and bumped the memory to 1.5GB. He loves it, and I was pretty pleased with the out-of-the-box experience as well (including their new packaging). There are only three real problems: it takes half an hour and three reboots to delete all of the crapware that’s preinstalled, you have to spend an hour burning recovery DVDs because they don’t ship media, and the default screensaver plays obnoxious music on a short loop.

The second user liked the SZ340P, but wanted something even lighter, so we bought her the SZ360P. It’s a quarter-pound lighter, uses the same docking station (which ships without its own power supply, but uses the same one as the laptop), and is also a really nice machine.

The downside of 4-pound laptops is they’re not as sturdy, so for the next four new-hires in line, I looked for something a little bigger, and ended up choosing the BX640P, with RAM bumped to 2GB. Different docking station (nicer, actually, with room for an optical drive and a spare battery to keep charged), different set of crapware, and not a widescreen display, but a better keyboard and a sturdier feel, and I’m equally pleased with its performance.

The only serious negative: it looks like the BX series will be discontinued, so when they run out and I need to start buying Vista machines, I’ll have to switch series. At the moment, I’m leaning a little toward the FE890 series, but PC Connection doesn’t stock the full range yet, so I can’t get the CPU/RAM/disk combination I want. With luck I can put that off for a few months, though.

With the previous brand, 2 of five had video and wireless problems. The five VAIOs I’ve set up so far have been rock-solid, and I expect the same from the other three that just arrived.

Sadly, while we’ll be able to put off the Vista migration for a little while (hopefully until Juniper gets their VPN client working…), Microsoft Office 2003 is a dead product, and starting Monday we’ll have users running 2007. On each user’s machine, I have to open up each Office application as that user, click the unobtrusive button that looks like a window decoration, click on the “{Word,Excel,PowerPoint} Options” button, select the Save tab, and set the “Save files in this format” option to use the Office 97-2003 format. Or else.

[Update: Actually, if you like ThinkPads and you’re willing to buy them right now, PC Connection has some nice clearance deals. If we were a bigger company, I might find the “buy 15, get one free” deal attractive…]

[4/17/2007 update: okay, one of the Sony BX laptops just lost its motherboard, after locking up at random intervals over a week or two. That still leaves 9/10 good ones, which is better than we got with Dell and Alienware.]

Saturday, April 7 2007

Vista doesn’t suck

Mossberg is right about how much crapware infects a brand-new Sony laptop running Vista, and that’s sad. Because Vista sucks a lot less than Windows XP, and it deserves better.

Despite all of the stuff that got cut from the release, Vista isn’t just another minor update to the ancient NT codebase. There are serious architectural changes that make it an honest-to-gosh 21st Century operating system that will produce a better user experience on newer hardware, once every vendor updates their software to use the new APIs.

Microsoft has done a lot of good, solid work to improve not just the use, but also the installation. Rory and I both did fresh installs of Vista Ultimate (onto MacBooks…), and felt that the install process was on par with Mac OS X, if not a little better in places. I’m still installing XP at least once a week, and I can’t tell you just how significant an improvement this is. [don’t ask about Linux installs, please; I just ate]

Is the Aero UI gaudy and gratuitous? Yes. Are the menus and control panels different from previous Windows in ways that aren’t obviously functional? Yes. Does any of that really matter after about twenty minutes of familiarization? No, not really. I expect the adjustment period to be pretty short for most users, and none of them will ever want to use an XP machine again, just like most Mac users were delighted to abandon the limits of Mac OS 9 once they settled into Mac OS X 10.0.

Because that’s what Vista is: Microsoft’s OS X 10.0, with all that that implies. The XP compatibility is a subtler version of Apple’s Classic environment, and they really, really want everyone to rewrite their software to use the modern APIs that, for instance, use “fonts” instead of “bitmaps”. There’s going to be a few years of mostly-compatible legacy apps, service packs that break random things as a side effect of improving performance and reliability, and general chaos and confusion. And because it’s Microsoft, they’re going to try to solve the problems faster by throwing more engineers at them, which never works out well.

In the end, though, Vista will have 90% of the desktop market, Mac OS X will have 9.99% of it, and the rest will be evenly divided between fourteen different Linux distributions that don’t ship with all of the drivers you need, but they’re free and you control everything and you can fix it yourself and it even has Ogg Vorbis support.

Office 2007, on the other hand, is a major upgrade hassle, and it has nothing to do with functionality or cost. Microsoft’s grand release plan failed to cope with one very significant fact: experienced Office users know where everything is, and spend far more time navigating Office menus than they do Windows menus.

We’ve been forced to start slowly rolling it out at work, and it’s painful. Everyone who gets it hates it, because they need to get their work done right now, and they don’t have time to go to a retraining class and learn the joys of the ribbon and the “obviously superior” new arrangement of commands and menus. They don’t care about Vista; they just find the Start button, select “Word” or “Excel”, and they’re happy. But when the Word and Excel interfaces change in fundamental ways (and, worse, ignore the settings that are supposed to make sure files are saved in Office 2003 formats…), they’re angry and frustrated.

[Side note to Gerry: read the preceding paragraph carefully three times, and then shut the fuck up about OpenOffice.org as an alternative. It’s not better, it’s not a complete suite, it’s not as compatible, it adds to my support load, it requires just as much retraining effort, and I can’t hand the users Dummies books and send them off to training classes, which we don’t have time for anyway because we’re a startup in the middle of a major product launch. Got it?]

If we’d had the money a few months ago, we could have picked up the volume license agreements that would let us avoid Office 2007 for another year. And we’d have been a lot happier, because “launching your first product” is not the time to cut into everyone’s productivity by changing their tools.

Thursday, May 3 2007

Best $25 we’ve spent recently

This is a remarkably useful gadget that’s paid for itself several times in the past month. What it does: connect an IDE or SATA drive via USB2 without putting it in an enclosure. It’s faster to work with the bare drive when you just need to grab some data from a failed machine or scrub a disk before reuse or service.

Thursday, May 10 2007

Adobe fucks up again

When Adobe released the CS suite, they added a revision control system called Version Cue. I had mixed feelings about it, but at least it was off by default.

When they released the CS2 suite, they turned it on by default, without any regard for security. I was less than thrilled:

The only nice thing I can say about it is that it doesn’t add a new rule to the built-in Mac OS X firewall to open up the ports it uses.

Care to guess what CS3 does? If you guessed “adds a new firewall rule”, you’d only be half right. It adds a new firewall rule, and then turns off the firewall. That part’s a mistake, obviously, but silently modifying your firewall settings to turn on an unsecured file server is unforgivable.

[Update: Adobe acknowledges their mistake in turning off the firewall, but does not apologize for silently turning your machine into a server and sharing your documents]

Tuesday, May 29 2007

We, uh, “fixed the glitch”

I hate it when fixing one problem breaks something else, especially when it’s subtle.

A few weeks ago while testing our new IPSec VPN connections to external partners, we discovered that I could ssh/scp through the VPN from my Macs, but none of our Linux boxes could, and another Mac running allegedly-identical software had horrible performance issues.

The fix was a change in the OpenBSD firewall that also served as the IPSec endpoint: scrub reassemble tcp. The problem went away like magic.

Today, we found out that there’s a single external partner we have to post some data to via an HTTPS connection, and it worked fine from machines outside of our firewall, but failed about 50% of the time from all the machines inside our firewall.

…except for my Macs, which worked 100% of the time. I fired up a CentOS 5 Parallels session on one of them, and it failed 50% of the time. Surely it couldn’t be…

It was. Remove the scrub line, and the HTTPS post worked from everywhere, but now my IPSec VPNs were hosed again.

So:

scrub from any to $IPSEC1_INT reassemble tcp
scrub from any to $IPSEC2_INT reassemble tcp
scrub in

The root cause appears to be the partner’s IIS server failing to properly implement RFC 1323, causing some of the fragmented packets to be rejected during reassembly.

Monday, June 11 2007

Sigh…

A very active spammer decided to use a phony return address on munitions.com yesterday. The rejection messages from spam filters (“gosh, thanks, assholes”) were coming in in batches of around a thousand, which was not healthy; the machine was even rejecting SSH connections.

Fortunately, I have two virtual IP addresses with separate CBQ bandwidth queues, and ssh still worked on those. Once in, I was able to shut down the Postfix listener for munitions.com. I’ll leave it down for a few days, and hope that this clown switches phony addresses soon.

And maybe I’ll see about adding that SPF record I haven’t gotten around to…

Saturday, July 21 2007

Rule #1…

Once there was enough caffeine in my system, I remembered the first rule of system administration, and carefully reread the twice-forwarded email. Thanks, Walt; if you hadn’t passed on that key detail, we’d still be looking in the wrong place.

Oh, the rule? “Never let the user diagnose the problem.”

Thursday, August 9 2007

Two packets enter, one packet leaves!

Okay, I’m stumped. We have a ReadyNAS NV+ that holds Important Data, accessed primarily from Windows machines. Generally, it works really well, and we’ve been pretty happy with it for the last few months.

Monday, the Windows application that reads and writes the Important Data locked up on the primary user’s machine. Cryptic error messages that decrypted to “contact service for recovering your corrupted database” were seen.

Nightly backups of the device via the CIFS protocol worked fine. Reading and writing to the NAS from a Mac via CIFS worked fine. A second Windows machine equipped with the application worked fine, without any errors about corrupted data. I left the user working on that machine for the day, and did some after-hours testing that night.

The obvious conclusion was that the crufty old HP on the user’s desk was the problem (it had been moved on Friday), so I yanked it out of the way and temporarily replaced it with the other, working Windows box.

It didn’t work. I checked all the network connections, and everything looked fine. I took the working machine back to its original location, and it didn’t work any more. I took it down to the same switch as the NAS, and it didn’t work. My Mac still worked fine, though, so I used it to copy all of the Important Data from the ReadyNAS to our NetApp.

Mounting the NetApp worked fine on all machines in all locations. I can’t leave the data there long-term (in addition to being Important, it’s also Confidential), but at least we’re back in business.

I’m stumped. Right now, I’ve got a Mac and a Windows machine plugged into the same desktop gigabit switch (gigabit NICs everywhere), and the Mac copies a 50MB folder from the NAS in a few seconds, while the Windows machine gives up after a few minutes with a timeout error. The NAS reports:

smbd.log: write_data: write failure in writing to client 10.66.0.151. Error Connection reset by peer
smbd.log: write_data: write failure in writing to client 10.66.0.151. Error Broken pipe

The only actual hardware problem I ever found was a loose cable in the office where the working Windows box was located.

[Update: It’s being caused by an as-yet-unidentified device on the network. Consider the results of my latest test: if I run XP under Parallels on my Mac in shared (NAT) networking mode, it works fine; in bridged mode, it fails exactly like a real Windows box. Something on the subnet is passing out bad data that Samba clients ignore but real Windows machines obey. The NetApp works because it uses licensed Microsoft networking code instead of Samba.]

[8/23 Update: A number of recommended fixes have failed to either track down the offending machine or resolve the problem. The fact that it comes and goes is more support for the “single bad host” theory, but it’s hard to diagnose when you can’t run your tools directly on the NAS.

So I reached for a bigger hammer: I grabbed one of my old Shuttles that I’ve been testing OpenBSD configurations on, threw in a second NIC, configured it as an ethernet bridge, and stuck it in front of the NAS. That gave me an invisible network tap that could see all of the traffic going to the NAS, and also the ability to filter any traffic I didn’t like.

Just for fun, the first thing I did was turn on the bridge’s “blocknonip” option, to force Windows to use TCP to connect. And the problem went away. I still need to find the naughty host, but now I can do it without angry users breathing down my neck.]

Tuesday, October 2 2007

Network Autofellatio

How did I spend the last two days? Discovering that a machine that was powered off was sending a two megabit/second stream of SMTP traffic out through our firewall to another machine that had been powered off four days earlier, and that would have been on the far side of a VPN even if it had been turned on. And the VPN configuration had been removed from the firewall, which by this point was a completely different machine (hardware and OS) from the one that had been there four days earlier.

Sunday, November 11 2007

“Wow, the new network’s fast!”

Shame we have to let the users move into the building tomorrow morning.

Monday, December 17 2007

Powered by Amazon S3

I’ve been thinking of redoing my domains to cut down on hosting costs and bandwidth, and my back-of-the-envelope calculations for Amazon’s S3 storage service look pretty good. So, I’ve just moved my Japan vacation pictures and thumbnails over, and I’ll see what sort of bill it produces this month.

This has the side-effect of making my currently-photo-heavy site load a lot faster for everyone.

Tuesday, January 22 2008

Back on the air!

Gosh, I wonder if the power outage at my co-lo had anything to do with the guy who was wiring up a dozen new PDUs…

Friday, March 21 2008

When is data not data?

When it follows a commented-out line that ends in a backslash.

[this tip brought to you by OpenBSD’s ipsec.conf file, which considers the remaining partial record syntactically valid, triggering no warnings]

Friday, January 9 2009

Wimps

Over on slashdot, someone quotes:

“The study also found that over a third have suffered from sleepless nights or headaches as a result of IT problems at work, while 59 per cent spend between one and 10 hours a week working on IT systems outside normal hours….”

In the modern vernacular, I say “It’s fine, learn to play, noob”.

“Why, when I was a boy, we had to walk to the server room through three feet of snow, uphill both ways! Now get off my lawn!”

Wednesday, April 1 2009

NFSing a Leopard user

Saved for future reference, since I don’t do it very often…

  1. From another admin account, open System Preferences, click Accounts.
  2. Control-click the username, select Advanced Options.
  3. Change the UID and optionally the GID.
  4. From a root shell, chown -R the user’s files.
  5. If you changed the GID, run:
    dseditgroup -o edit -a $USER -t user $OLDGROUP
  6. If you want to add more local groups to match your NFS server, run:
    dseditgroup -o create -r $DESC -i $GID $GROUP

Tuesday, April 14 2009

“This. Is. My. BROOMSTICK!”

#!/bin/bash
cd deferred
find . -type f | split
for i in xa*; do
    for j in $(cat $i); do
        echo -n "$j " #literal tab
        postcat $j | awk '
            /^From: / {f=$0}
            /^To: / {t=$0}
            /^Subject: / {s=$0}
            END {printf("%s\t%s\t%s\n",f,t,s)}'
    done >> /tmp/jfoo
done

Open in Excel, sort to taste, cleanup as needed, save for later use…

Friday, April 24 2009

Proof my email works again

In the past seven hours, I have received 490 pieces of spam. One made it to my inbox. One almost made it to my inbox. The rest were caught by SpamSieve, with no false positives.

So, yes, I’m pretty sure that the catchall mailbox at jgreely.com is working again. :-)

I moved one of my parked domains to a new account at Pair, the hosting side of domain registrar PairNIC. They offer clean multiple-domain support with catchall mailboxes and sophisticated filtering, secure IMAP and SMTP, and a full range of scripting languages and libraries under FreeBSD. Once I’ve tested everything out with that domain, I’ll move jgreely.com over, as well as the J-E dictionary I’m currently hosting on jgreely.net.

It will be a while before I can resume the blog upgrade work I started a while back, so dotclue.org won’t move to the Pair account any time soon, and neither will my old high-volume picture site (which survives because of bandwidth-throttling firewall rules). All of the non-blog CGI will probably end up on jgreely.net once that domain is migrated off of the flaky old Shuttle sitting in my closet.

Monday, July 27 2009

“Show me on this doll where the bad SQL touched you”

I don’t want a database guru. I want a database ogre, who lives in a dank, dark cave lined with the bones of developers who think they can write their own queries and release them to Production.

During my latest round of load-testing, I discovered that one particular client-driven query degrades rather seriously under load. As in, fifteen minutes to use a unique device ID to look up the matching unique customer ID and a single string related to RMA status. Part of the problem was that the dev was looking in the wrong place, but the main problem was that he didn’t understand the data, so the query was written in a way guaranteed to maximize search time. (rant about poor schema design saved for another day…)

I am a SQL caveman. My formal training in database technology began and ended with a single COBOL class in the mid-Eighties. I rewrote the query and dropped the time to 0.062 seconds under the same heavy load.

Four orders of magnitude? Time to feed another dev to the ogre!

(and, yes, the checkin comment attached to this query begins “optimized the sql query for …”. The sad thing is that, relatively speaking, this is a true statement; his previous code was worse)

My new Monday t-shirt

Monday, August 17 2009

It’s sad that this makes me happy…

SQL interface to Perforce.

It’s been around for quite a while, but I’d never noticed it; most of my data-mining has been at levels that can be satisfied with the usual command-line interface. It will come in handy for my branch-to-branch bugfix-integration report, though.

Sunday, October 11 2009

Microsoft, Sidekick, Danger

Reading the emerging story about the T-Mobile/Sidekick data loss, I was surprised to discover that this guy isn’t working there. In fact, he’s not even on the West Coast any more, which makes me feel better about all my data.

I have some friends at what used to be Danger, and I know they’ve been working frantically at damage control, but I can only see blood on the walls in this one. Some people screwed up bigtime, years ago, both procedurally and technically, and if the original culprits are gone, their replacements will get axed for not spotting, and removing, the vulnerabilities.

Microsoft can afford the financial hit it’s going to take from this, but the PR hit is devastating. Any product line that says “trust us with your data” is in big trouble.

[why, yes, I did just update and verify my offsite backups; why do you ask?]

Friday, November 13 2009

HexBash

I had a perfectly good reason for doing it this way…

declare id
function hexhex () {
  printf -v id %06X 0x$1
}

Wednesday, March 10 2010

Arbitrary limits

As a general rule, office firewalls do not have to be configured to cope with simultaneous incoming syslog traffic from 80,000+ hosts. Mine did. Sadly, the default limit for a particular element was only capable of handling about 3/4 of that, leaving our outgoing connections somewhere between unstable and “not” when things got busy.

Fixed now.

PS: syslog can be scary efficient at sending packets when a box is unhappy. Enough unhappy boxes makes for a quite impressive DDOS attack, if you haven’t previously discovered that using “no state” in a firewall rule does not, in fact, avoid filling your state table with crap, thus accelerating your approach toward that arbitrary limit.

Tuesday, March 16 2010

Dear Vmware,

Come on, really?

“We’d like to keep you informed via email about product updates, upgrades, special offers and pricing. If you do not wish to be contacted via email, please ensure that the box is not checked.”

At least the box is not not unchecked by default, but this is stupid.

Tuesday, May 18 2010

Tuning OpenBSD…

Hear my words, and know them to be true:

“Adjust the unlabeled knob to the unmarked setting, then press the unspecified button an undocumented number of times. Fixes everything.”

I’m only half-kidding. Sadly, it always seems to be worth the effort. This time, it was to replace the CentOS server that kept locking up in less than 24 hours under the stress of incoming syslog data from 80,000+ hosts. Admittedly, syslog-ng is partially at fault, given that it leaks about 20 file handles per minute, but you wouldn’t expect that to cause scary ext3 errors that require a reboot. The BSD ffs seems to be more mature in that regard, although its performance goes to hell fast unless you turn on soft dependencies.

[Update: Oh, and to be fair, I should mention the downside of this, which is simply that adjusting the right knob to the wrong setting (or vice-versa) will kill everyone within thirty yards of the server.]

Wednesday, September 15 2010

Nothing says “script kiddie”…

… like using a single MAC address to repeatedly attack nearby wireless networks for several days.

Unsuccessfully.

Wednesday, February 2 2011

One from the trenches…

“I’m going to write out a log entry every time I see this sort of packet, and put it at WARNING level. This will help me solve a serious problem!”
    – Anonymous Developer

1,000 packets/second later on N devices…

Thursday, May 26 2011

First one that cackles…

For future reference, when someone comes by complaining about a rogue DHCP server on their network, check under their desk first.

Sunday, July 29 2012

When you buy a 10 Terabyte desktop NAS…

…it really sucks to run out of inodes when the device is barely 40% full. I mean, it’s not like I’m adding 250,000 files a day or something goofy like that.

Oh, wait, that’s exactly what I was doing. Drat.

Friday, September 7 2012

MySQL surprise!

So some code that made it through QA without a hitch blew chunks in Production. Fingers were pointed in various directions, but fundamentally, the SQL queries were incredibly simple, doing a simple retrieval based on matching a unique index, like so:

select field1 from MYTABLE where phone = “12223334444”

Production insisted that the moment the new software was deployed, the dedicated slave DB was being pounded into the ground, and the culprit was large numbers of full table scans. But all of the queries we knew about were exactly like the one above: retrieve part of a single record based on an exact match of a primary key.

Their DB was bigger than ours, so I loaded up a few hundred thousand phony records and tried again. As I expected, my thrash script barely raised the load. Four copies of it spewing these queries as fast as they could barely raised the load. Then I turned on log_queries_not_using_indexes to see if I was getting the volume of full table scans they were, and of course I wasn’t.

Then a real query came in from the new software, and went right into the slow-query log as a full table scan. Why? Because it looked like this:

select field1 from MYTABLE where phone = 12223334444

Fuck, fuck, fuck, fuck, fuck. MySQL silently converts the integer to a string and returns the right answer when you do this, but doesn’t use the index.

Friday, October 19 2012

Dear APC,

When someone plugs a serial cable into one of your commercial-grade UPS units, the correct response is not to shut the unit off, interrupting power to the expensive device that’s being protected by it.

Wednesday, November 21 2012

EnGenius ENH202 Wireless Bridge

Good news: the building we’re moving into has never been occupied by another company. Bad news: it’s never been occupied by another company. In other words, there isn’t a single incoming network cable of any kind, and the few people willing to wire the place up are all running a bit behind schedule. If we had something better than a 3G modem, we could at least move a few people over there early, but so far nobody’s delivered. (…and a firmly-extended middle finger to Comcast, who offered us a great deal and then tried to get us to pay more than $10,000 to extend their network so it could reach the building)

Fortunately, the new place isn’t that far from the old place, and even more fortunately, the EnGenius ENH202 is trivial to configure and costs less than $100. And unlike the $300+ wireless bridge we tried, it actually powers up when you plug it in!

And it works quite nicely so far. No serious environmental sealing, so in a long-term installation you’d want to cover it in some fashion, but we’ll be happy if it lasts through Christmas.

[Update: damn this thing worked out nicely, making the move a lot less painful.]

Monday, January 14 2013

Followup questions are important…

User: “Help! I can’t find some of the files I need on the server for this morning’s meeting.”

Sysadmin: “Okay, that server looks fine, and we have good backups. What folders are missing files?”

User: “Well, I was looking in the agendas folder, and then it was gone, and there was a porn folder, and a sexy pictures folder, and…”

Sysadmin: “That sounds a little more serious than missing files. We’re on our way.”

Tuesday, February 19 2013

Dear users,

When you detect that an incoming email contains a virus-infected attachment, please do not forward the virus-infected message to other people saying “hey, if you got this, don’t open it”.

Monday, April 22 2013

Important NTP safety tip

When the clocks on internal hosts are drifting out of sync despite the fact that everything runs NTP, make sure that the server everything is pointed at isn’t pointed at itself.

(unless it has an attached GPS or other source of correct time, of course, which this one didn’t)

Monday, June 24 2013

Three hours of my life I want back…

“No, we just moved our office, we didn’t change anything except the external IP address. The VPN problem must be on your end. Did you set the new IP address?”.

“Okay, we did install a new NAT router. But the problem must be on your end. Did you set the new IP address?”

“Oh, yes, it’s running a newer version of the OS. But the problem must be on your end. Did you set the new IP address?”

“Here are screenshots of our config. But the problem must be on your end. Did you set the new IP address?”

“Yes, we set it up with IKEv2 instead of v1. But the problem must be on your end. Did you set the new IP address?”

It’s actually been more than eight hours, and they still haven’t fixed their problem, but I at least got some sleep in the middle. We’d still be arguing about what the problem actually is if they hadn’t sent me the screenshots.

Oh, and it was urgent for me to make the change on my end Friday night (which they told me about on Friday afternoon…), but no one at their end actually checked their router for connectivity until this morning. And it’s been nearly an hour since they responded to the message that they’re using the wrong IKE version, but they still haven’t fixed it.

[Update: to add insult to injury, I just got a recruiting email from WalmartLabs. Perhaps the fact that it’s raining in Northern California in late June should have been a clue that the week was going to be a little odd.]

Wednesday, July 10 2013

ESXi 5.1 on Dell

We bought a Dell R620 to run VMware ESXi 5.1U1. It was pre-configured to correctly boot the supplied ESXi image from an SD card. Bringing it up on the network was trivial. Downloading the Windows vSphere Client software was trivial. Configuring a datastore so that you could actually use the product was annoying.

Y’see, they shipped it with a Windows GPT partition table, and attempting to use the disk produced a lengthy timeout and disconnect, every time. Occasionally, I’d get a pop-up error message, but couldn’t select it to cut and paste, and enabling ssh on the server showed that no errors were being logged.

Typing the error message in by hand (“… HostDatastoreSystem.QueryVmfsDatastoreCreateOptions … failed”) and googling it turned up detailed solutions for the problem, with obsolete commands. So, for the benefit of anyone else who gets into this state on ESXi 5.1:

  1. Enable ssh, and log in as root.
  2. Run esxcli storage core path list, locate your disk by the display name that showed up in the vSphere Client, and save the contents of the Device: field (mine was naa.6b8ca3a0e8405800195f77a21641467c).
  3. run partedUtil mklabel /vmfs/devices/disks/device msdos

Now you can use it as a datastore.

Tuesday, March 4 2014

435,265

That’s the number of emails sent out this morning by a test service that was getting pummeled by an automated QA script.

Mood: Cranky.

[Update: after many eyes explored the logs, the QA test script was found to have done exactly the right thing, and the bug was in the actual service. So, a big huzzah for catching a truly crippling bug before it reached Production, but damn that was a mess.]

Friday, March 21 2014

Aero considered harmful

Outlook 2013 started breaking for our users last week. Only some of them, and not all at the same time, but the symptom was that the application would no longer start, hanging at the “loading profile…” screen.

The solution is to switch to the “Windows 7 Basic” graphical theme, turning off all the 3D UI decorations.

No, seriously.

And that’s about four days of sysadmin time that we’d like back, please.