Fun fact…

When guaranteed-non-disruptive data-center power maintenance takes down your Confluence database server, application servers that survive and reconnect may end up speaking in tongues. Specifically, the “Other Macros” screen in edit mode ended up in a mix of English and Polish. A rolling restart fixed it.


How many words in this headline are not stupid?

Introducing Crowdsec: A Modernized, Collaborative Massively Multiplayer Firewall for Linux

Yeah, get back to me when you have data…

User: “Hey, my SQL queries are timing out on the replica!”

J: “Hmmm, that error says they’re so slow that the master is trying to clean up all the rows that have changed since it started running.”

U: “Can we try it on the primary?”

J: “No, because 1) Production, and 2) error is specific to replicas.”

U: (CCs Partner)

Partner: “Our service that queries your DB, sends the results over a VPN tunnel, and ingests them into our system is working fine, and doesn’t show any delays or network issues.”

J: “That’s what you said last time, and the problem ‘just went away’ the next night.”

P: “Try running this specific query locally and tell me how it works.”

J: (examines 450-line SQL query, shrugs, runs it) “2.5 minutes.”

P: “Hmm, works here with ‘limit 10000’ and fails with ‘limit 50000’. Well, nothing we can do on our end! Shall I set up a call with our Engineering team?”

J: “Wait, what was the runtime for that query on your end before it started timing out this week? For that matter, what was the runtime when it worked with ‘limit 10000’?”


P: “Here’s a chart showing it bang-on at almost that exact same runtime for weeks, until it started timing out every single time on Friday night.”

J: “Okay, let us know when you’ve fixed that. Just for fun, try changing the query to just return the count of rows instead of the ~36 MB of data. I, um, have a hunch.”


U: “9.6 seconds.”

J: “So, you can successfully submit obnoxious queries through Partner’s interface, as long as they don’t return any significant quantity of data. Hmmm, what does that sound like?”


(long meeting full of fingerpointing with no indication of how it started failing 100% of the time like throwing a switch)

J: “I have finally managed to reproduce the failure locally, which means I’m willing to try a small config-file change to work around the problem. Reminder: we still have no hint as to the actual cause.”

[you are here]

Random Thoughts, Time Machine edition

Dear Amazon,

You know the drill:

Hmmm, have they done an isekai series about ending up in another world as a monster-girl samurai psychologist yet?

Fun Mac Fact

If a Time Machine backup is interrupted for any reason, it may leave behind an unkillable backupd process. If this happens, even automatic local snapshots will stop working until you reboot. And by “reboot” I mean power-cycle, because MacOS doesn’t know what to do about an unkillable system process; it kills off everything it can and then just sits there, helpless.

Part of the problem is that the menubar indicator that’s supposed to show when a backup is active does not include the “preparing” or “stopping” stages, so if you were to, say, close your laptop lid during those stages, or change your network configuration by starting a VPN connection or switching from wired to wireless, you could trigger the problem.

For more fun, if your Time Machine backups are on a NAS, they’re stored in a disk image, which needs to be fscked periodically (part of the lengthy “verifying” stage), and must be fscked after any error. And that can take hours. And if it fails, the only solution Apple offers is to destroy your entire backup history and start over, potentially leaving you with no backups at all until the first new one completes, which, again, takes hours, especially with the default “run really slow in the background” setting enabled.

Pro tip:

sudo sysctl debug.lowpri_throttle_enabled=0

There are instructions (1, 2, but none from Apple) for how to manually fsck a TM image (possibly multiple times) and correctly mark it as usable again, a process that has the potential to take days.

And that’s why I keep two separate SuperDuper backups of my laptop in addition to the two separate TM backup drives (the “belt, suspenders, bungee cords, and super-glue” approach). Time Machine is far too fragile to rely on for anything but quick single-file restores, although it can be useful for migrating to replacement hardware that won’t boot a cloned disk.

In the standard “you’re holding it wrong” Apple way, you can’t just turn on automatic local snapshots; you have to have at least one external volume configured for automatic TM backups. In fact, the manpage seems to claim that you can’t make local snapshots at all unless you’ve got at least one external TM backup. This suggests that the optimum strategy is to use SuperDuper every day to have bootable full backups, set up TM without automatic backups, and then set up a cron job to create and manage local snapshots. And manually kick off TM backups every week or so when you’re sure you won’t need to use your computer for a few hours.

Dear Synology,

When installing a system update on one of your NAS products, having it fail with the following message is “less than encouraging”:

Re-downloading the DSM update (6.2.3-25426) and trying again didn’t help.

On the bright side, it’s still up and running.


Somehow I missed an episode of Good Eats: Reloaded, so I got to watch two back-to-back yesterday: pot roast and oatmeal. I never tried the much-reviled pot roast recipe from the original episode, and Alton would be horrified to discover that I actually like single-serving instant oatmeal.

What stuck out for me was that the reloaded oatmeal recipes (1, 2) are tarted up with those trendiest of grains, quinoa and chia. The reloaded granola recipe just sounds unpleasant. The new pot roast looks decent, but I’m not going to run out and try it when it’s hot out.

Perhaps when the rainy season starts again in the fall. (although I’m actually still getting some light rain occasionally, mostly at night)

Metaphor Alert!

I went to check on the status of Corona-chan restrictions in Monterey County, only to discover that none of the authoritative DNS servers for are responding, and the records have timed out everywhere. Sounds like a virus to me!

Stage 2.2 update

Yesterday, the county agreed to beg the governor for permission to enter phase 2 of stage 2. The state has “acknowledged receipt of the Form”.

Should they approve it, dine-in restaurants, full-service car washes, shopping malls, and pet grooming will once again be legal. Not sure about haircuts, since there’s industry-specific advice that suggests yes, but they fall into a category that’s still listed as “phase 3”.

On the bright side, residential cleaning services will be permitted to reopen, which means I can have my already-pretty-damn-clean house thoroughly scrubbed.

Hopefully my dentist had the financial resources to ride this out and can reopen. Soon.

By golly, haircuts are in stage 2.2 now

…assuming salons and barbershops meet the detailed requirements to reopen, some of which assume the owner/operator has plenty of extra money to upgrade the facilities and purchase a large stock of disposable everythings. And that they can get stylists to come back to work despite making more on unemployment thanks to the extra-goodies laws.

(in many cases, stylists are basically independent operators who rent their stations, making them ineligible for unemployment benefits, but California mostly outlawed freelancers this year, so I’m not actually sure what their status is any more)

Creeping featurism, Jira plugin edition…

So there’s a third-party plugin for Jira called “Git Integration for Jira”, whose description claims that all it does is query your Gitlab server and display links to the commits that reference the current Jira issue.

Nowhere does it mention that it also defaults to sending warning emails to addresses mentioned in the commits, even if they don’t map to any of your Jira users.

Like, say, the Freeswitch mailing list.

The initial helpdesk ticket wasn’t terribly useful in figuring this out, since it consisted of a screenshot of someone’s inbox that didn’t include their email address or even a full subject line, much less something useful like, say, complete email headers. It took three rounds with one of our local devs to elicit the keywords “git integration plugin”, “project key FS”, and “smart commits”.

That last bit is what let me know I was looking at the correct configuration screen, since it’s right above the feature to randomly send email to addresses scraped from the git commit.

Technically, it was handling it…

Woke up this morning, looked at my phone, and saw that I hadn’t received any work email since about 1:15 AM. Since I’m guaranteed to get at least one hourly cron-job result, that’s bad.

Login to mail server (good! that means the VPN is up and the servers still have power!), check the queue, and it eventually returns a number in excess of 500,000.

Almost all of them going to the qanotify alias. Sent from a single server.

The good news is that this made it very easy to remove them all from the queue. The bad news is that I can’t just kill it at the source; QA is furiously testing stuff for CES, and I don’t know which pieces are related to that. And, no, no one in QA actually checks for email from the test services, so they won’t know until I — wait for it — email them directly.

For more fun, the specific process that’s doing it is not sending through the local server’s Postfix service, so I can’t shut it down there, either. It’s making direct SMTP connections to the central IT mail relay server.

Well, that I can fix. plonk

(this didn’t delay incoming email from outside the company, just things like cron jobs and trouble tickets and the hourly reports that customer service needs to do their jobs; so, no pressure, y’know)

First update

QA: “I see in the logs that the SMTP server isn’t responding.”

J: “Correct. And it will stay that way until this is fixed.”

(I find myself doing this a lot these days; User: “X doesn’t do Y!”, J: “Correct”)

Second update

Dev Manager: “Could you send us an example of the kind of emails that you’re seeing?”

J: “You mean the one that’s in the message you’re replying to?”

Third update

DM: “Can you give my team access to all of the actual emails?”

J: “No, I deleted the 500,000+ that were in the spool. But it looks like at least 25,000 got through to this list of people on your team, who would have known about this before I did if they didn’t have filters set up to ignore email coming from the service nodes.”

J: “And, what the hell, here’s thirty seconds of work from the shell isolating the most-reported CustomerPKs from the 25,000 emails that got through, so you can grep the logs in a useful way.”

Fourth update

John: “Ticket opened, assigned to devs, and escalated.”

(John used to work for me…)

Fifth update

Senior Dev: “Ooh, my bad; when I refactored SpocSkulker, I had it return ERROR instead of WARNING when processing an upgrade/downgrade for a customer that didn’t currently have active services. Once a minute. Each.”

SD: “Oh, and you can hand-edit the Tomcat config to point SMTP to an invalid server while you’re waiting for the new release.”

J: “Yeah, no, I’ll just keep blocking the traffic until the release is rolled out and I’ve confirmed with tcpdump.”

(one of my many hats here used to be server-side QA for the services involved, so I immediately knew it was coming from SpocSkulker, and could have shut it off myself; but then it wouldn’t have gotten fixed until January)

Anticipated update

J receives massive fruit basket from Production team for catching this before it rolled out to them and took out their email servers.

‘Fun’ with HSTS

There is nothing wrong with using good old-fashioned HTTP without encryption. There are situations where it is a perfectly reasonable thing to do, and the protocol shouldn’t be blindly tagged with dire warnings about people kidnapping your dog, stealing your credit cards, and secretly replacing your spouse with Folger’s Crystals.

Browser vendors disagree, for reasons not-entirely-wholesome, so it’s been an ongoing struggle at the office to deal with people who file helpdesk tickets about broken SSL on sites that never had SSL to begin with, and don’t understand that their browser is silently rewriting URLs and hiding the evidence.

With recent browser releases, it got to the point where we had to put SSL reverse proxies in front of a bunch of internal web sites just to shut up the whining. This was non-trivial, and left a number of sites only partially functional for most of a day (because of course this was so important that it couldn’t be tested, QA’d, or released on a weekend). Because once a site gets “upgraded” to HTTPS, the browser responds to any HTTP links like CNN covering a Trump rally.

That was Tuesday. Wednesday night, I was wandering through the desert on a sand-seal with no name, and out of the corner of my eye I saw my phone sync up about a dozen emails, all complaining about this new HSTS thingie (aka “SSL Bondage”).

Someone urgently needed access to a site that was rejecting SSL connections, so he CC’d a half-dozen people along with the helpdesk email address. Several of them responded to all, creating additional tickets. Several people responded to the responses.

When I’d finished merging the 10 duplicate tickets, my one-line response was “correct. we haven’t added HTTPS to that site yet”.

“Need a clue, take a clue,
 got a clue, leave a clue”