Monday, July 21 2003

Sanitizing Apache with PF

About 45 minutes elapsed between the moment that I first turned this server on and the arrival of the first virus/worm/hacker probes. It was obvious that most of them were looking for Windows-based web servers, so they were harmless to me.

Still, I like to review the logs occasionally, and the sheer volume of this crap was getting annoying. Later, when I raised from the dead, I discovered that it was getting more than 30,000 hits a day for a file containing the word “ok”. Worst of all, as I prepare to restore my photo archives, I know that I can’t afford to pay for the bandwidth while they’re slurped up by every search engine, cache site, obsessive collector, Usenet reposter, and eBay scammer on the planet.

Enter PF, the OpenBSD packet filter.

There are a number of forums which discuss using PF’s table support to dynamically block undesired traffic. All of the threads I found used cron jobs to periodically search through log files for suspicious connections, and added the IP addresses to a table that was used in a rule like this:

block drop in quick on $ext_if inet from <badlife> to any

The only problem with this implementation is that it either isn’t responsive enough, or it chews up too much CPU searching through the log for new entries. The solution is to do what the Perl-based pop-before-smtp script does: tail the logfile in real time and handle addresses as soon as they show up.

With the File::Tail module already installed, it only took a few minutes to build a script that examined Apache logs for suspicious URLs and added the requesters to the <badlife> table.

All the Cricket servers requesting that “ok” file hundreds of times per day? Gone. Code Red? Gone. The seventeen different site-building and accounting system CGI scripts that I’d never heard of before? Gone. The probes to see what web server I’m running and if it allows proxy connections? Gone. Search engines that ignore robots.txt? Gone.

What’s left? Well, I haven’t gotten rid of the bandwidth-killers. Yet. In the old days it wasn’t a problem, because I wasn’t being billed for bandwidth, and the 1 megabit/second average rate (with occasional bursts up to 7 Mb/sec) didn’t cost me a fortune. In fact, if it hadn’t been for the sleazy bastards printing out my photos and fraudulently selling them on eBay as “real prints from the negative” (dozens of auctions every month), I’d still be willing to allow them to be freely downloaded in bulk.

Elsewhere on this site, I describe a cookie-based scheme that would enforce fairness and remove the ludicrous claims of innocent infringement, but I just don’t have the patience to implement and test it right now. I want to put the pictures back up sooner rather than later, and the combination of PF and a log-scanning script offers a simple interim solution.

The Perl script is a trivial modification of the one that looks for bad URLs: tail the Apache log looking for JPG files, and for each one found that’s over a certain size, increment a counter for that IP address. If the counter reaches X in Y seconds, add that address to the <slowdown> table, which is used to define a PF queue throttled to a much lower rate. Say, 5% of the available bandwidth, which is already limited based on my monthly billing.

The entry can be removed if no new requests arrive in Z minutes, and I can reset the table occasionally to avoid punishing people who share a proxy server with the abuser. Values for X, Y, and Z will be chosen based on the first few days worth of logs, to avoid false positives.

All the scripts whitelist my static IP address block at home, and I’ve already set up another table that uses the pop-before-smtp data to grant unlimited bandwidth to authenticated users. No sense punishing the people I give accounts to.

I still recognize a difference between human beings who want copies of all my pictures and automated tools that try to grab everything. The former get slowed down; the latter get banned. The difference between the two is pretty easy to detect, and is left as an exercise for the reader.