3D Cheesecake: De Dupes


Inevitably, there are duplicate images in my cheesecake archives. Sometimes it’s the exact same file with a different name, which I can detect with a simple MD5 checksum, but often they’re different sizes, or some site has added a watermark, or a magazine overlayed it with text, or someone cropped off the text that someone else added, etc, etc.

Enter PDQ, an image-similarity hashing system that works pretty darn well. Despite coming from the evil facebook empire (usable for detecting kiddie-pr0n and wrongthink memes), the code is pretty decent, compiles cleanly, and only blows up if you feed it a file that doesn’t contain a single image convertable with ImageMagick (pro tip: do not run it on a directory that contains a video file; your swapfile will thank me). A quick review of the images it clustered together confirmed that fully 11% of my images were duplicates.

So what better for a cheesecake theme than images I liked so much I managed to download them at least four times? (not counting any copies I’ve already posted and deleted from the archive, of course; I’ll have to go through my S3 backups sometime to find those)

The following de-duplication recipe uses Miller to process the output; I’d somehow overlooked this tool for years, and I can think of at least one project at work that I wouldn’t be stuck maintaining any more if it were a directory full of mlr recipes instead of Perl modules.

# gather up all your image files
#
find . -type f -name '[0-9a-zA-Z]*.[pjPJ]*' | sort > /tmp/images

# edit the list to remove anything that's not an image (text, video,
# etc); also sanity-check for annoying file names (containing things
# like commas(!), whitespace, quotes, parentheses, etc)

# generate the hashes; this is the tedious part
# (~13/sec on my 12-inch MacBook with images stored on an external SSD)
#
pdq-photo-hasher -d -i < /tmp/images > /tmp/hashes

# cluster similar images, then strip out all images with
# cluster-size=1 (unique)
#
clusterize256 /tmp/hashes | mlr filter '$clusz > 1' > /tmp/alldupes

# extract their filenames
#
mlr --onidx cut -f filename /tmp/alldupes > /tmp/files

# create file containing (filename, height, size) for all images
#
xargs identify -format 'filename=%i,height=%h,size=%B\n' \
    < /tmp/files > /tmp/meta

# join it to the original, for consolidated output
#
mlr join -j filename -f /tmp/meta /tmp/alldupes > /tmp/alldupes2

# for each cluster, keep the file with the largest (height, size)
#
mlr sort -nr height,size then \
    head -n 1 -g clidx then \
    sort -n clidx then \
    cut -f filename /tmp/alldupes2 > /tmp/keep

# create the complementary set of images to delete
#
fgrep -v -f /tmp/keep /tmp/alldupes2 |
    mlr --onidx cut -f filename > /tmp/nuke

# move the dupes to another directory
# (rather than deleting them immediately...)
#
mkdir -p DUPES
mv $(</tmp/nuke) DUPES

When you add additional images to your collection, you can generate their hashes and compare them to the existing data (amusingly, you have to use the tool backwards…):

# hash the new images
#
pdq-photo-hasher [0-9a-zA-Z]*.[pjPJ]* > /tmp/newstuff

# print the filenames of new dupes
# (note that mih-query is a bit twitchy about formatting; the
# hash field must be first, and non-pdq fields need to be at
# the end)
#
mih-query /tmp/hashes /tmp/newstuff | grep match= | mlr --onidx cut -f 4

# add remaining hashes to your DB of unique images

Bonus for correctly guessing which image I had eight copies of. 😁


The cheesecake comes after the spell


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.