So some code that made it through QA without a hitch blew chunks in Production. Fingers were pointed in various directions, but fundamentally, the SQL queries were incredibly simple, doing a simple retrieval based on matching a unique index, like so:
select field1 from MYTABLE where phone = "12223334444"
Production insisted that the moment the new software was deployed, the dedicated slave DB was being pounded into the ground, and the culprit was large numbers of full table scans. But all of the queries we knew about were exactly like the one above: retrieve part of a single record based on an exact match of a primary key.
Their DB was bigger than ours, so I loaded up a few hundred thousand phony records and tried again. As I expected, my thrash script barely raised the load. Four copies of it spewing these queries as fast as they could barely raised the load. Then I turned on log_queries_not_using_indexes to see if I was getting the volume of full table scans they were, and of course I wasn’t.
Then a real query came in from the new software, and went right into the slow-query log as a full table scan. Why? Because it looked like this:
select field1 from MYTABLE where phone = 12223334444
Fuck, fuck, fuck, fuck, fuck. MySQL silently converts the integer to a string and returns the right answer when you do this, but doesn’t use the index.
…it really sucks to run out of inodes when the device is barely 40% full. I mean, it’s not like I’m adding 250,000 files a day or something goofy like that.
Oh, wait, that’s exactly what I was doing. Drat.
For future reference, when someone comes by complaining about a rogue DHCP server on their network, check under their desk first.
"I'm going to write out a log entry every time I see this sort of packet, and put it at WARNING level. This will help me solve a serious problem!"
-- Anonymous Developer
1,000 packets/second later on N devices…
… like using a single MAC address to repeatedly attack nearby wireless networks for several days.
Unsuccessfully.
Hear my words, and know them to be true:
"Adjust the unlabeled knob to the unmarked setting, then press the unspecified button an undocumented number of times. Fixes everything."
I’m only half-kidding. Sadly, it always seems to be worth the effort. This time, it was to replace the CentOS server that kept locking up in less than 24 hours under the stress of incoming syslog data from 80,000+ hosts. Admittedly, syslog-ng is partially at fault, given that it leaks about 20 file handles per minute, but you wouldn’t expect that to cause scary ext3 errors that require a reboot. The BSD ffs seems to be more mature in that regard, although its performance goes to hell fast unless you turn on soft dependencies.
[Update: Oh, and to be fair, I should mention the downside of this, which is simply that adjusting the right knob to the wrong setting (or vice-versa) will kill everyone within thirty yards of the server.]
Come on, really?
"We'd like to keep you informed via email about product updates, upgrades, special offers and pricing. If you do not wish to be contacted via email, please ensure that the box is not checked."
At least the box is not not unchecked by default, but this is stupid.
As a general rule, office firewalls do not have to be configured to cope with simultaneous incoming syslog traffic from 80,000+ hosts. Mine did. Sadly, the default limit for a particular element was only capable of handling about 3/4 of that, leaving our outgoing connections somewhere between unstable and “not” when things got busy.
Fixed now.
PS: syslog can be scary efficient at sending packets when a box is unhappy. Enough unhappy boxes makes for a quite impressive DDOS attack, if you haven’t previously discovered that using “no state” in a firewall rule does not, in fact, avoid filling your state table with crap, thus accelerating your approach toward that arbitrary limit.