Technically, it was handling it...


Woke up this morning, looked at my phone, and saw that I hadn’t received any work email since about 1:15 AM. Since I’m guaranteed to get at least one hourly cron-job result, that’s bad.

Login to mail server (good! that means the VPN is up and the servers still have power!), check the queue, and it eventually returns a number in excess of 500,000.

Almost all of them going to the qanotify alias. Sent from a single server.

The good news is that this made it very easy to remove them all from the queue. The bad news is that I can’t just kill it at the source; QA is furiously testing stuff for CES, and I don’t know which pieces are related to that. And, no, no one in QA actually checks for email from the test services, so they won’t know until I — wait for it — email them directly.

For more fun, the specific process that’s doing it is not sending through the local server’s Postfix service, so I can’t shut it down there, either. It’s making direct SMTP connections to the central IT mail relay server.

Well, that I can fix. plonk

(this didn’t delay incoming email from outside the company, just things like cron jobs and trouble tickets and the hourly reports that customer service needs to do their jobs; so, no pressure, y’know)

First update

QA: “I see in the logs that the SMTP server isn’t responding.”

J: “Correct. And it will stay that way until this is fixed.”

(I find myself doing this a lot these days; User: “X doesn’t do Y!”, J: “Correct”)

Second update

Dev Manager: “Could you send us an example of the kind of emails that you’re seeing?”

J: “You mean the one that’s in the message you’re replying to?”

Third update

DM: “Can you give my team access to all of the actual emails?”

J: “No, I deleted the 500,000+ that were in the spool. But it looks like at least 25,000 got through to this list of people on your team, who would have known about this before I did if they didn’t have filters set up to ignore email coming from the service nodes.”

J: “And, what the hell, here’s thirty seconds of work from the shell isolating the most-reported CustomerPKs from the 25,000 emails that got through, so you can grep the logs in a useful way.”

Fourth update

John: “Ticket opened, assigned to devs, and escalated.”

(John used to work for me…)

Fifth update

Senior Dev: “Ooh, my bad; when I refactored SpocSkulker, I had it return ERROR instead of WARNING when processing an upgrade/downgrade for a customer that didn’t currently have active services. Once a minute. Each.”

SD: “Oh, and you can hand-edit the Tomcat config to point SMTP to an invalid server while you’re waiting for the new release.”

J: “Yeah, no, I’ll just keep blocking the traffic until the release is rolled out and I’ve confirmed with tcpdump.”

(one of my many hats here used to be server-side QA for the services involved, so I immediately knew it was coming from SpocSkulker, and could have shut it off myself; but then it wouldn’t have gotten fixed until January)

Anticipated update

J receives massive fruit basket from Production team for catching this before it rolled out to them and took out their email servers.


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.