We, uh, "fixed the glitch"


I hate it when fixing one problem breaks something else, especially when it’s subtle.

A few weeks ago while testing our new IPSec VPN connections to external partners, we discovered that I could ssh/scp through the VPN from my Macs, but none of our Linux boxes could, and another Mac running allegedly-identical software had horrible performance issues.

The fix was a change in the OpenBSD firewall that also served as the IPSec endpoint: scrub reassemble tcp. The problem went away like magic.

Today, we found out that there’s a single external partner we have to post some data to via an HTTPS connection, and it worked fine from machines outside of our firewall, but failed about 50% of the time from all the machines inside our firewall.

…except for my Macs, which worked 100% of the time. I fired up a CentOS 5 Parallels session on one of them, and it failed 50% of the time. Surely it couldn’t be…

It was. Remove the scrub line, and the HTTPS post worked from everywhere, but now my IPSec VPNs were hosed again.

So:

scrub from any to $IPSEC1_INT reassemble tcp
scrub from any to $IPSEC2_INT reassemble tcp
scrub in

The root cause appears to be the partner’s IIS server failing to properly implement RFC 1323, causing some of the fragmented packets to be rejected during reassembly.