Cheesecake

3D Cheesecake: De Dupes

Sun 8/4/19 12:30am

Comments: 0

Cheesecake

Inevitably, there are duplicate images in my cheesecake archives. Sometimes it’s the exact same file with a different name, which I can detect with a simple MD5 checksum, but often they’re different sizes, or some site has added a watermark, or a magazine overlayed it with text, or someone cropped off the text that someone else added, etc, etc.

Enter PDQ, an image-similarity hashing system that works pretty darn well. Despite coming from the evil facebook empire (usable for detecting kiddie-pr0n and wrongthink memes), the code is pretty decent, compiles cleanly, and only blows up if you feed it a file that doesn’t contain a single image convertable with ImageMagick (pro tip: do not run it on a directory that contains a video file; your swapfile will thank me). A quick review of the images it clustered together confirmed that fully 11% of my images were duplicates.

So what better for a cheesecake theme than images I liked so much I managed to download them at least four times? (not counting any copies I’ve already posted and deleted from the archive, of course; I’ll have to go through my S3 backups sometime to find those)

The following de-duplication recipe uses Miller to process the output; I’d somehow overlooked this tool for years, and I can think of at least one project at work that I wouldn’t be stuck maintaining any more if it were a directory full of mlr recipes instead of Perl modules.

view/hide

# gather up all your image files
#
find . -type f -name '[0-9a-zA-Z]*.[pjPJ]*' | sort > /tmp/images

# edit the list to remove anything that's not an image (text, video,
# etc); also sanity-check for annoying file names (containing things
# like commas(!), whitespace, quotes, parentheses, etc)

# generate the hashes; this is the tedious part
# (~13/sec on my 12-inch MacBook with images stored on an external SSD)
#
pdq-photo-hasher -d -i < /tmp/images > /tmp/hashes

# cluster similar images, then strip out all images with
# cluster-size=1 (unique)
#
clusterize256 /tmp/hashes | mlr filter '$clusz > 1' > /tmp/alldupes

# extract their filenames
#
mlr --onidx cut -f filename /tmp/alldupes > /tmp/files

# create file containing (filename, height, size) for all images
#
xargs identify -format 'filename=%i,height=%h,size=%B\n' \
    < /tmp/files > /tmp/meta

# join it to the original, for consolidated output
#
mlr join -j filename -f /tmp/meta /tmp/alldupes > /tmp/alldupes2

# for each cluster, keep the file with the largest (height, size)
#
mlr sort -nr height,size then \
    head -n 1 -g clidx then \
    sort -n clidx then \
    cut -f filename /tmp/alldupes2 > /tmp/keep

# create the complementary set of images to delete
#
fgrep -v -f /tmp/keep /tmp/alldupes2 |
    mlr --onidx cut -f filename > /tmp/nuke

# move the dupes to another directory
# (rather than deleting them immediately...)
#
mkdir -p DUPES
mv $(</tmp/nuke) DUPES

When you add additional images to your collection, you can generate their hashes and compare them to the existing data (amusingly, you have to use the tool backwards…):

# hash the new images
#
pdq-photo-hasher [0-9a-zA-Z]*.[pjPJ]* > /tmp/newstuff

# print the filenames of new dupes
# (note that mih-query is a bit twitchy about formatting; the
# hash field must be first, and non-pdq fields need to be at
# the end)
#
mih-query /tmp/hashes /tmp/newstuff | grep match= | mlr --onidx cut -f 4

# add remaining hashes to your DB of unique images

Bonus for correctly guessing which image I had eight copies of. 😁

more...

OMC: Asuka Kishi

Mon 7/29/19 7:16am

Comments: 0

Cheesecake

As popular and photogenic as she is, it’s a bit surprising that they just keep recycling the same handful of photoshoots of Asuka Kishi (岸明日香). Then again, she’s only had 3 photobooks in six years compared to 15 DVDs, so perhaps her fans just want to see them bounce.

more...

Pixiv: R-18

Wed 7/24/19 6:21am

Comments: 0

Cheesecake

The Pixiv tag “R-18” means different things to different artists. For some, it’s equivalent to “NSFW”, but for many, it’s closer to “includes censored penetration”, and quite often “my mother would kick me out of the basement if she knew I fantasized about this”, with anything not quite as raw left open for all to see.

Gratuitous Mahoro to make it clear that this set has work-safety issues:

[amusing note: Google image search thinks this Mahoro image is related to the Mongolian Society of Interventional Radiology]

more...

3D Cheesecake 20

Fri 7/19/19 10:17am

Comments: 0

Cheesecake

[ Series: Doggerel ]

The Naming Of Cheesecake’s a bit cloak-and-dagger,
It isn’t just one of your image search games.
You may think that I’m just a lazy-ass blogger,
When I tell you, these models have actual names.

more...

A Certain Scientific Waitress

Sat 7/13/19 10:17am

Comments: 0

Cheesecake

Drinks free of charge.

Pixiv: (not available)

Thu 7/11/19 7:50am

Comments: 0

Cheesecake

Every once in a while artists delete their illustrations from Pixiv or mark them private. Unfortunately, I wasn’t recording the artist’s ID in my offline database until recently, so for about 2% of my archive, I can’t easily track them back to their creator. In most cases, I suspect they’ve been replaced with updated versions, but I’d have to search by hand, like our primitive ancestors once did.

Instead, I’ll just clean them out, and use my updated scripts to track artist info from now on. If you want to try to search for them, I’ve added my offline copy of their tags (if any) to the title tooltip, and links for the ones marked private.

pic1 ERROR: 69736749 not found on Pixiv Fate/GrandOrder ジャンヌ・ダルクモードレッド FGO Fate/GO1000users入り着衣巨乳

more...

3D Cheesecake 19

Tue 7/9/19 9:38am

Comments: 0

Cheesecake

I’m a big fan.

more...

Pixiv: undercover big boobs

Mon 7/8/19 2:24am

Comments: 0

Cheesecake

Well, technically, 着衣巨乳 means “clothed huge breasts”, but that counts as undercover, right? As long as they don’t have private dicks, I’m okay with it…

more...

Cheesecake

3D Cheesecake: De Dupes

OMC: Asuka Kishi

Pixiv: R-18

3D Cheesecake 20

A Certain Scientific Waitress

Pixiv: (not available)

3D Cheesecake 19

Pixiv: undercover big boobs

Steven Den Beste

Recently Spotted

Elsewhere

Repeating Endlessly