Inevitably, there are duplicate images in my cheesecake archives. Sometimes it’s the exact same file with a different name, which I can detect with a simple MD5 checksum, but often they’re different sizes, or some site has added a watermark, or a magazine overlayed it with text, or someone cropped off the text that someone else added, etc, etc.
Enter PDQ, an image-similarity hashing system that works pretty darn well. Despite coming from the evil facebook empire (usable for detecting kiddie-pr0n and wrongthink memes), the code is pretty decent, compiles cleanly, and only blows up if you feed it a file that doesn’t contain a single image convertable with ImageMagick (pro tip: do not run it on a directory that contains a video file; your swapfile will thank me). A quick review of the images it clustered together confirmed that fully 11% of my images were duplicates.
So what better for a cheesecake theme than images I liked so much I managed to download them at least four times? (not counting any copies I’ve already posted and deleted from the archive, of course; I’ll have to go through my S3 backups sometime to find those)
The following de-duplication recipe uses
Miller to process the output;
I’d somehow overlooked this tool for years, and I can think of at
least one project at work that I wouldn’t be stuck maintaining any
more if it were a directory full of
mlr recipes instead of Perl
# gather up all your image files # find . -type f -name '[0-9a-zA-Z]*.[pjPJ]*' | sort > /tmp/images # edit the list to remove anything that's not an image (text, video, # etc); also sanity-check for annoying file names (containing things # like commas(!), whitespace, quotes, parentheses, etc) # generate the hashes; this is the tedious part # (~13/sec on my 12-inch MacBook with images stored on an external SSD) # pdq-photo-hasher -d -i < /tmp/images > /tmp/hashes # cluster similar images, then strip out all images with # cluster-size=1 (unique) # clusterize256 /tmp/hashes | mlr filter '$clusz > 1' > /tmp/alldupes # extract their filenames # mlr --onidx cut -f filename /tmp/alldupes > /tmp/files # create file containing (filename, height, size) for all images # xargs identify -format 'filename=%i,height=%h,size=%B\n' \ < /tmp/files > /tmp/meta # join it to the original, for consolidated output # mlr join -j filename -f /tmp/meta /tmp/alldupes > /tmp/alldupes2 # for each cluster, keep the file with the largest (height, size) # mlr sort -nr height,size then \ head -n 1 -g clidx then \ sort -n clidx then \ cut -f filename /tmp/alldupes2 > /tmp/keep # create the complementary set of images to delete # fgrep -v -f /tmp/keep /tmp/alldupes2 | mlr --onidx cut -f filename > /tmp/nuke # move the dupes to another directory # (rather than deleting them immediately...) # mkdir -p DUPES mv $(</tmp/nuke) DUPES
When you add additional images to your collection, you can generate their hashes and compare them to the existing data (amusingly, you have to use the tool backwards…):
# hash the new images # pdq-photo-hasher [0-9a-zA-Z]*.[pjPJ]* > /tmp/newstuff # print the filenames of new dupes # (note that mih-query is a bit twitchy about formatting; the # hash field must be first, and non-pdq fields need to be at # the end) # mih-query /tmp/hashes /tmp/newstuff | grep match= | mlr --onidx cut -f 4 # add remaining hashes to your DB of unique images
Bonus for correctly guessing which image I had eight copies of. 😄
Markdown formatting and simple HTML accepted.
Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.