Sysadmin

Random Thoughts, Time Machine edition


Dear Amazon,

You know the drill:

Hmmm, have they done an isekai series about ending up in another world as a monster-girl samurai psychologist yet?

Fun Mac Fact

If a Time Machine backup is interrupted for any reason, it may leave behind an unkillable backupd process. If this happens, even automatic local snapshots will stop working until you reboot. And by “reboot” I mean power-cycle, because MacOS doesn’t know what to do about an unkillable system process; it kills off everything it can and then just sits there, helpless.

Part of the problem is that the menubar indicator that’s supposed to show when a backup is active does not include the “preparing” or “stopping” stages, so if you were to, say, close your laptop lid during those stages, or change your network configuration by starting a VPN connection or switching from wired to wireless, you could trigger the problem.

For more fun, if your Time Machine backups are on a NAS, they’re stored in a disk image, which needs to be fscked periodically (part of the lengthy “verifying” stage), and must be fscked after any error. And that can take hours. And if it fails, the only solution Apple offers is to destroy your entire backup history and start over, potentially leaving you with no backups at all until the first new one completes, which, again, takes hours, especially with the default “run really slow in the background” setting enabled.

Pro tip:

sudo sysctl debug.lowpri_throttle_enabled=0

There are instructions (1, 2, but none from Apple) for how to manually fsck a TM image (possibly multiple times) and correctly mark it as usable again, a process that has the potential to take days.

And that’s why I keep two separate SuperDuper backups of my laptop in addition to the two separate TM backup drives (the “belt, suspenders, bungee cords, and super-glue” approach). Time Machine is far too fragile to rely on for anything but quick single-file restores, although it can be useful for migrating to replacement hardware that won’t boot a cloned disk.

In the standard “you’re holding it wrong” Apple way, you can’t just turn on automatic local snapshots; you have to have at least one external volume configured for automatic TM backups. In fact, the manpage seems to claim that you can’t make local snapshots at all unless you’ve got at least one external TM backup. This suggests that the optimum strategy is to use SuperDuper every day to have bootable full backups, set up TM without automatic backups, and then set up a cron job to create and manage local snapshots. And manually kick off TM backups every week or so when you’re sure you won’t need to use your computer for a few hours.

Dear Synology,


When installing a system update on one of your NAS products, having it fail with the following message is “less than encouraging”:

Re-downloading the DSM update (6.2.3-25426) and trying again didn’t help.

On the bright side, it’s still up and running.

Unrelated,

Somehow I missed an episode of Good Eats: Reloaded, so I got to watch two back-to-back yesterday: pot roast and oatmeal. I never tried the much-reviled pot roast recipe from the original episode, and Alton would be horrified to discover that I actually like single-serving instant oatmeal.

What stuck out for me was that the reloaded oatmeal recipes (1, 2) are tarted up with those trendiest of grains, quinoa and chia. The reloaded granola recipe just sounds unpleasant. The new pot roast looks decent, but I’m not going to run out and try it when it’s hot out.

Perhaps when the rainy season starts again in the fall. (although I’m actually still getting some light rain occasionally, mostly at night)

Metaphor Alert!

I went to check on the status of Corona-chan restrictions in Monterey County, only to discover that none of the authoritative DNS servers for www.co.monterey.ca.us are responding, and the records have timed out everywhere. Sounds like a virus to me!

Stage 2.2 update

Yesterday, the county agreed to beg the governor for permission to enter phase 2 of stage 2. The state has “acknowledged receipt of the Form”.

Should they approve it, dine-in restaurants, full-service car washes, shopping malls, and pet grooming will once again be legal. Not sure about haircuts, since there’s industry-specific advice that suggests yes, but they fall into a category that’s still listed as “phase 3”.

On the bright side, residential cleaning services will be permitted to reopen, which means I can have my already-pretty-damn-clean house thoroughly scrubbed.

Hopefully my dentist had the financial resources to ride this out and can reopen. Soon.

By golly, haircuts are in stage 2.2 now

…assuming salons and barbershops meet the detailed requirements to reopen, some of which assume the owner/operator has plenty of extra money to upgrade the facilities and purchase a large stock of disposable everythings. And that they can get stylists to come back to work despite making more on unemployment thanks to the extra-goodies laws.

(in many cases, stylists are basically independent operators who rent their stations, making them ineligible for unemployment benefits, but California mostly outlawed freelancers this year, so I’m not actually sure what their status is any more)

Creeping featurism, Jira plugin edition…


So there’s a third-party plugin for Jira called “Git Integration for Jira”, whose description claims that all it does is query your Gitlab server and display links to the commits that reference the current Jira issue.

Nowhere does it mention that it also defaults to sending warning emails to addresses mentioned in the commits, even if they don’t map to any of your Jira users.

Like, say, the Freeswitch mailing list.

The initial helpdesk ticket wasn’t terribly useful in figuring this out, since it consisted of a screenshot of someone’s inbox that didn’t include their email address or even a full subject line, much less something useful like, say, complete email headers. It took three rounds with one of our local devs to elicit the keywords “git integration plugin”, “project key FS”, and “smart commits”.

That last bit is what let me know I was looking at the correct configuration screen, since it’s right above the feature to randomly send email to addresses scraped from the git commit.

Technically, it was handling it…


Woke up this morning, looked at my phone, and saw that I hadn’t received any work email since about 1:15 AM. Since I’m guaranteed to get at least one hourly cron-job result, that’s bad.

Login to mail server (good! that means the VPN is up and the servers still have power!), check the queue, and it eventually returns a number in excess of 500,000.

Almost all of them going to the qanotify alias. Sent from a single server.

The good news is that this made it very easy to remove them all from the queue. The bad news is that I can’t just kill it at the source; QA is furiously testing stuff for CES, and I don’t know which pieces are related to that. And, no, no one in QA actually checks for email from the test services, so they won’t know until I — wait for it — email them directly.

For more fun, the specific process that’s doing it is not sending through the local server’s Postfix service, so I can’t shut it down there, either. It’s making direct SMTP connections to the central IT mail relay server.

Well, that I can fix. plonk

(this didn’t delay incoming email from outside the company, just things like cron jobs and trouble tickets and the hourly reports that customer service needs to do their jobs; so, no pressure, y’know)

First update

QA: “I see in the logs that the SMTP server isn’t responding.”

J: “Correct. And it will stay that way until this is fixed.”

(I find myself doing this a lot these days; User: “X doesn’t do Y!”, J: “Correct”)

Second update

Dev Manager: “Could you send us an example of the kind of emails that you’re seeing?”

J: “You mean the one that’s in the message you’re replying to?”

Third update

DM: “Can you give my team access to all of the actual emails?”

J: “No, I deleted the 500,000+ that were in the spool. But it looks like at least 25,000 got through to this list of people on your team, who would have known about this before I did if they didn’t have filters set up to ignore email coming from the service nodes.”

J: “And, what the hell, here’s thirty seconds of work from the shell isolating the most-reported CustomerPKs from the 25,000 emails that got through, so you can grep the logs in a useful way.”

Fourth update

John: “Ticket opened, assigned to devs, and escalated.”

(John used to work for me…)

Fifth update

Senior Dev: “Ooh, my bad; when I refactored SpocSkulker, I had it return ERROR instead of WARNING when processing an upgrade/downgrade for a customer that didn’t currently have active services. Once a minute. Each.”

SD: “Oh, and you can hand-edit the Tomcat config to point SMTP to an invalid server while you’re waiting for the new release.”

J: “Yeah, no, I’ll just keep blocking the traffic until the release is rolled out and I’ve confirmed with tcpdump.”

(one of my many hats here used to be server-side QA for the services involved, so I immediately knew it was coming from SpocSkulker, and could have shut it off myself; but then it wouldn’t have gotten fixed until January)

Anticipated update

J receives massive fruit basket from Production team for catching this before it rolled out to them and took out their email servers.

‘Fun’ with HSTS


There is nothing wrong with using good old-fashioned HTTP without encryption. There are situations where it is a perfectly reasonable thing to do, and the protocol shouldn’t be blindly tagged with dire warnings about people kidnapping your dog, stealing your credit cards, and secretly replacing your spouse with Folger’s Crystals.

Browser vendors disagree, for reasons not-entirely-wholesome, so it’s been an ongoing struggle at the office to deal with people who file helpdesk tickets about broken SSL on sites that never had SSL to begin with, and don’t understand that their browser is silently rewriting URLs and hiding the evidence.

With recent browser releases, it got to the point where we had to put SSL reverse proxies in front of a bunch of internal web sites just to shut up the whining. This was non-trivial, and left a number of sites only partially functional for most of a day (because of course this was so important that it couldn’t be tested, QA’d, or released on a weekend). Because once a site gets “upgraded” to HTTPS, the browser responds to any HTTP links like CNN covering a Trump rally.

That was Tuesday. Wednesday night, I was wandering through the desert on a sand-seal with no name, and out of the corner of my eye I saw my phone sync up about a dozen emails, all complaining about this new HSTS thingie (aka “SSL Bondage”).

Someone urgently needed access to a site that was rejecting SSL connections, so he CC’d a half-dozen people along with the helpdesk email address. Several of them responded to all, creating additional tickets. Several people responded to the responses.

When I’d finished merging the 10 duplicate tickets, my one-line response was “correct. we haven’t added HTTPS to that site yet”.

Configuring a static IP in %pre


When building out a new kickstart server for CentOS 7.x/8, I vigorously ignored the cruft that had built up in the old one and started over. One useful improvement is handling the basic network configuration properly:

%pre --log=/tmp/pre.log --interpreter=/usr/bin/bash
DNS=10.201.0.2,10.201.0.3
IFACE=$(ip -4 -o a | awk '/scope global/{print $2}')
IPADDR=$(ip -4 -o a | awk '/scope global/{print $4}')
declare $(ipcalc $IPADDR -h -m -n)
cat << EOF > /tmp/network.cfg
network --onboot yes --device $IFACE --bootproto static --noipv6 --onboot=on --activate --nameserver=$DNS --hostname=$HOSTNAME --ip=${IPADDR%%/*} --netmask=$NETMASK --gateway=${NETWORK%%.0}.1
EOF
%end

%include /tmp/network.cfg

We use TXT records in DNS to automagically set static DHCP leases and PXE config files, so that all we need for an unattended install of a new server is the MAC address of the first NIC. This little snippet tells Anaconda to hardcode that IP config so the server doesn’t depend on DHCP after the install is done. This has saved our bacon many times after power outages, because if your DHCP server doesn’t come back quickly (or at all), dhclient eventually gives up, and then you have to touch everything by hand to get them back online.

To get the base install down to something simple, quick, and easy to secure (~380 RPMs, minimal network services), I dusted off the Perl script I wrote back in 2009 that does dependency analysis based on the repodata files comps.xml and primary.sqlite. Still works pretty well, actually.

I deeply regret that we had to bite the bullet and start using a systemd-based release. CentOS 6.x was just getting too long in the tooth for functionality, and then one day we suddenly needed to ship a new DNS/DHCP/NTP/mail server to a new office, and the only available hardware was a NUC too new to run 6.10. My feelings about systemd can be summed up with this Ace Ventura clip.

“War. War Never Changes…”


And by that I mean the war on users filling up file systems.

When I got into this business, it was whack-a-meg, as undergrads discovered Usenet binary newsgroups and the “zurich” ftp site at MIT, and tried to be clever about hiding their stash.

Over time, it became whack-a-gig, and largely moved from porn to work-related stupidity, such as tarball-based version control (bonus for having multiple sets of tarball, compressed tarball, and unpacked source tree), two-year-old copies of Production logs (often with an accompanying set of tarballs…), laptop backups including music and video libraries, etc, etc.

This week it’s whack-a-terabyte, as I try to figure if there’s actually a legitimate reason for the accounting share to have multiple folders that are each larger than the entire company’s 13-year source-control history…

A few small repairs…


Early yesterday morning, I got email from a server whose RAID 10 array was rebuilding. As far as I could tell, one of the SSDs had briefly gone offline, just long enough to force the controller to resync it.

Mildly disturbing. I told the team to make sure we had a cold spare ready to go, and we should prep to swap it in if we saw anything else unusual before we had a chance to schedule a maintenance window.

Early this morning, two SSDs failed in that server, and this did not include the one that blipped yesterday. That reeks of RAID controller failure, and since we didn’t have an identical one on hand with identical firmware, the best bet was moving the whole damn thing to completely different hardware (more precisely, our shiny new VMware/Tegile cluster).

Fortunately it’s only half a terabyte, backed up at least three different ways, and everything’s on a 10G network, but pretty much all of engineering is twiddling their thumbs until it’s back, so “no pressure”.

Update

Start to finish, it took about 7 hours from the time we pulled the trigger on the move. A good chunk of that was spent checksumming the data and copying back the dozen or so files that were corrupted.

Now I’m just watching the rsync backup run with “-c” to make sure the corrupted data didn’t propagate. Honestly, it would be faster to blow away the destination and do a regular rsync, but then I’d have one less mostly-valid backup for N hours. I don’t really care how long it takes to run, and doing it this way reassures any management types who ask questions later.

“Need a clue, take a clue,
 got a clue, leave a clue”