Sysadmin

When is data not data?


When it follows a commented-out line that ends in a backslash.

[this tip brought to you by OpenBSD’s ipsec.conf file, which considers the remaining partial record syntactically valid, triggering no warnings]

Back on the air!


Gosh, I wonder if the power outage at my co-lo had anything to do with the guy who was wiring up a dozen new PDUs…

Powered by Amazon S3


I’ve been thinking of redoing my domains to cut down on hosting costs and bandwidth, and my back-of-the-envelope calculations for Amazon’s S3 storage service look pretty good. So, I’ve just moved my Japan vacation pictures and thumbnails over, and I’ll see what sort of bill it produces this month.

This has the side-effect of making my currently-photo-heavy site load a lot faster for everyone.

"Wow, the new network's fast!"


Shame we have to let the users move into the building tomorrow morning.

Network Autofellatio


How did I spend the last two days? Discovering that a machine that was powered off was sending a two megabit/second stream of SMTP traffic out through our firewall to another machine that had been powered off four days earlier, and that would have been on the far side of a VPN even if it had been turned on. And the VPN configuration had been removed from the firewall, which by this point was a completely different machine (hardware and OS) from the one that had been there four days earlier.

Two packets enter, one packet leaves!


Okay, I’m stumped. We have a ReadyNAS NV+ that holds Important Data, accessed primarily from Windows machines. Generally, it works really well, and we’ve been pretty happy with it for the last few months.

Monday, the Windows application that reads and writes the Important Data locked up on the primary user’s machine. Cryptic error messages that decrypted to “contact service for recovering your corrupted database” were seen.

Nightly backups of the device via the CIFS protocol worked fine. Reading and writing to the NAS from a Mac via CIFS worked fine. A second Windows machine equipped with the application worked fine, without any errors about corrupted data. I left the user working on that machine for the day, and did some after-hours testing that night.

The obvious conclusion was that the crufty old HP on the user’s desk was the problem (it had been moved on Friday), so I yanked it out of the way and temporarily replaced it with the other, working Windows box.

It didn’t work. I checked all the network connections, and everything looked fine. I took the working machine back to its original location, and it didn’t work any more. I took it down to the same switch as the NAS, and it didn’t work. My Mac still worked fine, though, so I used it to copy all of the Important Data from the ReadyNAS to our NetApp.

Mounting the NetApp worked fine on all machines in all locations. I can’t leave the data there long-term (in addition to being Important, it’s also Confidential), but at least we’re back in business.

I’m stumped. Right now, I’ve got a Mac and a Windows machine plugged into the same desktop gigabit switch (gigabit NICs everywhere), and the Mac copies a 50MB folder from the NAS in a few seconds, while the Windows machine gives up after a few minutes with a timeout error. The NAS reports:

smbd.log: write_data: write failure in writing to client 10.66.0.151. Error Connection reset by peer smbd.log: write_data: write failure in writing to client 10.66.0.151. Error Broken pipe

The only actual hardware problem I ever found was a loose cable in the office where the working Windows box was located.

[Update: It’s being caused by an as-yet-unidentified device on the network. Consider the results of my latest test: if I run XP under Parallels on my Mac in shared (NAT) networking mode, it works fine; in bridged mode, it fails exactly like a real Windows box. Something on the subnet is passing out bad data that Samba clients ignore but real Windows machines obey. The NetApp works because it uses licensed Microsoft networking code instead of Samba.]

[8/23 Update: A number of recommended fixes have failed to either track down the offending machine or resolve the problem. The fact that it comes and goes is more support for the “single bad host” theory, but it’s hard to diagnose when you can’t run your tools directly on the NAS.

So I reached for a bigger hammer: I grabbed one of my old Shuttles that I’ve been testing OpenBSD configurations on, threw in a second NIC, configured it as an ethernet bridge, and stuck it in front of the NAS. That gave me an invisible network tap that could see all of the traffic going to the NAS, and also the ability to filter any traffic I didn’t like.

Just for fun, the first thing I did was turn on the bridge’s “blocknonip” option, to force Windows to use TCP to connect. And the problem went away. I still need to find the naughty host, but now I can do it without angry users breathing down my neck.]

Rule #1...


Once there was enough caffeine in my system, I remembered the first rule of system administration, and carefully reread the twice-forwarded email. Thanks, Walt; if you hadn’t passed on that key detail, we’d still be looking in the wrong place.

Oh, the rule? “Never let the user diagnose the problem.”

Sigh...


A very active spammer decided to use a phony return address on munitions.com yesterday. The rejection messages from spam filters (“gosh, thanks, assholes”) were coming in in batches of around a thousand, which was not healthy; the machine was even rejecting SSH connections.

Fortunately, I have two virtual IP addresses with separate CBQ bandwidth queues, and ssh still worked on those. Once in, I was able to shut down the Postfix listener for munitions.com. I’ll leave it down for a few days, and hope that this clown switches phony addresses soon.

And maybe I’ll see about adding that SPF record I haven’t gotten around to…

“Need a clue, take a clue,
 got a clue, leave a clue”