Thursday, August 9 2007

Two packets enter, one packet leaves!

Okay, I’m stumped. We have a ReadyNAS NV+ that holds Important Data, accessed primarily from Windows machines. Generally, it works really well, and we’ve been pretty happy with it for the last few months.

Monday, the Windows application that reads and writes the Important Data locked up on the primary user’s machine. Cryptic error messages that decrypted to “contact service for recovering your corrupted database” were seen.

Nightly backups of the device via the CIFS protocol worked fine. Reading and writing to the NAS from a Mac via CIFS worked fine. A second Windows machine equipped with the application worked fine, without any errors about corrupted data. I left the user working on that machine for the day, and did some after-hours testing that night.

The obvious conclusion was that the crufty old HP on the user’s desk was the problem (it had been moved on Friday), so I yanked it out of the way and temporarily replaced it with the other, working Windows box.

It didn’t work. I checked all the network connections, and everything looked fine. I took the working machine back to its original location, and it didn’t work any more. I took it down to the same switch as the NAS, and it didn’t work. My Mac still worked fine, though, so I used it to copy all of the Important Data from the ReadyNAS to our NetApp.

Mounting the NetApp worked fine on all machines in all locations. I can’t leave the data there long-term (in addition to being Important, it’s also Confidential), but at least we’re back in business.

I’m stumped. Right now, I’ve got a Mac and a Windows machine plugged into the same desktop gigabit switch (gigabit NICs everywhere), and the Mac copies a 50MB folder from the NAS in a few seconds, while the Windows machine gives up after a few minutes with a timeout error. The NAS reports:

smbd.log: write_data: write failure in writing to client Error Connection reset by peer
smbd.log: write_data: write failure in writing to client Error Broken pipe

The only actual hardware problem I ever found was a loose cable in the office where the working Windows box was located.

[Update: It’s being caused by an as-yet-unidentified device on the network. Consider the results of my latest test: if I run XP under Parallels on my Mac in shared (NAT) networking mode, it works fine; in bridged mode, it fails exactly like a real Windows box. Something on the subnet is passing out bad data that Samba clients ignore but real Windows machines obey. The NetApp works because it uses licensed Microsoft networking code instead of Samba.]

[8/23 Update: A number of recommended fixes have failed to either track down the offending machine or resolve the problem. The fact that it comes and goes is more support for the “single bad host” theory, but it’s hard to diagnose when you can’t run your tools directly on the NAS.

So I reached for a bigger hammer: I grabbed one of my old Shuttles that I’ve been testing OpenBSD configurations on, threw in a second NIC, configured it as an ethernet bridge, and stuck it in front of the NAS. That gave me an invisible network tap that could see all of the traffic going to the NAS, and also the ability to filter any traffic I didn’t like.

Just for fun, the first thing I did was turn on the bridge’s “blocknonip” option, to force Windows to use TCP to connect. And the problem went away. I still need to find the naughty host, but now I can do it without angry users breathing down my neck.]