Inevitably, there are duplicate images in my cheesecake archives. Sometimes it’s the exact same file with a different name, which I can detect with a simple MD5 checksum, but often they’re different sizes, or some site has added a watermark, or a magazine overlayed it with text, or someone cropped off the text that someone else added, etc, etc.
Enter PDQ, an image-similarity hashing system that works pretty darn well. Despite coming from the evil facebook empire (usable for detecting kiddie-pr0n and wrongthink memes), the code is pretty decent, compiles cleanly, and only blows up if you feed it a file that doesn’t contain a single image convertable with ImageMagick (pro tip: do not run it on a directory that contains a video file; your swapfile will thank me). A quick review of the images it clustered together confirmed that fully 11% of my images were duplicates.
So what better for a cheesecake theme than images I liked so much I managed to download them at least four times? (not counting any copies I’ve already posted and deleted from the archive, of course; I’ll have to go through my S3 backups sometime to find those)
The following de-duplication recipe uses
Miller to process the output;
I’d somehow overlooked this tool for years, and I can think of at
least one project at work that I wouldn’t be stuck maintaining any
more if it were a directory full of mlr
recipes instead of Perl
modules.
# gather up all your image files
#
find . -type f -name '[0-9a-zA-Z]*.[pjPJ]*' | sort > /tmp/images
# edit the list to remove anything that's not an image (text, video,
# etc); also sanity-check for annoying file names (containing things
# like commas(!), whitespace, quotes, parentheses, etc)
# generate the hashes; this is the tedious part
# (~13/sec on my 12-inch MacBook with images stored on an external SSD)
#
pdq-photo-hasher -d -i < /tmp/images > /tmp/hashes
# cluster similar images, then strip out all images with
# cluster-size=1 (unique)
#
clusterize256 /tmp/hashes | mlr filter '$clusz > 1' > /tmp/alldupes
# extract their filenames
#
mlr --onidx cut -f filename /tmp/alldupes > /tmp/files
# create file containing (filename, height, size) for all images
#
xargs identify -format 'filename=%i,height=%h,size=%B\n' \
< /tmp/files > /tmp/meta
# join it to the original, for consolidated output
#
mlr join -j filename -f /tmp/meta /tmp/alldupes > /tmp/alldupes2
# for each cluster, keep the file with the largest (height, size)
#
mlr sort -nr height,size then \
head -n 1 -g clidx then \
sort -n clidx then \
cut -f filename /tmp/alldupes2 > /tmp/keep
# create the complementary set of images to delete
#
fgrep -v -f /tmp/keep /tmp/alldupes2 |
mlr --onidx cut -f filename > /tmp/nuke
# move the dupes to another directory
# (rather than deleting them immediately...)
#
mkdir -p DUPES
mv $(</tmp/nuke) DUPES
When you add additional images to your collection, you can generate their hashes and compare them to the existing data (amusingly, you have to use the tool backwards…):
# hash the new images
#
pdq-photo-hasher [0-9a-zA-Z]*.[pjPJ]* > /tmp/newstuff
# print the filenames of new dupes
# (note that mih-query is a bit twitchy about formatting; the
# hash field must be first, and non-pdq fields need to be at
# the end)
#
mih-query /tmp/hashes /tmp/newstuff | grep match= | mlr --onidx cut -f 4
# add remaining hashes to your DB of unique images
Bonus for correctly guessing which image I had eight copies of. 😁
As popular and photogenic as she is, it’s a bit surprising that they just keep recycling the same handful of photoshoots of Asuka Kishi (岸明日香). Then again, she’s only had 3 photobooks in six years compared to 15 DVDs, so perhaps her fans just want to see them bounce.
The Pixiv tag “R-18” means different things to different artists. For some, it’s equivalent to “NSFW”, but for many, it’s closer to “includes censored penetration”, and quite often “my mother would kick me out of the basement if she knew I fantasized about this”, with anything not quite as raw left open for all to see.
Gratuitous Mahoro to make it clear that this set has work-safety issues:
[amusing note: Google image search thinks this Mahoro image is related to the Mongolian Society of Interventional Radiology]
The Naming Of Cheesecake’s a bit cloak-and-dagger,
It isn’t just one of your image search games.
You may think that I’m just a lazy-ass blogger,
When I tell you, these models have actual names.
Every once in a while artists delete their illustrations from Pixiv or mark them private. Unfortunately, I wasn’t recording the artist’s ID in my offline database until recently, so for about 2% of my archive, I can’t easily track them back to their creator. In most cases, I suspect they’ve been replaced with updated versions, but I’d have to search by hand, like our primitive ancestors once did.
Instead, I’ll just clean them out, and use my updated scripts to track artist info from now on. If you want to try to search for them, I’ve added my offline copy of their tags (if any) to the title tooltip, and links for the ones marked private.
Well, technically, 着衣巨乳 means “clothed huge breasts”, but that counts as undercover, right? As long as they don’t have private dicks, I’m okay with it…