With apologies to Genesis…
🎶 🎶 🎶 🎶
Well I’ve been surfing, surfing porn so long,
But thinking bondage, bondage was just wrong.
Well now I know,
It increases visibility
Of prime femininity,
And redefines the term “restraining order”.
She seems to be in invisible cuffs, damn!
I could reach out and grab right hold of her parts.
When I see girls in invisible cuffs, damn!
I lose control, my heart rate goes off the charts.
Well I don’t even know her, or how to spell her name,
Just that she shows all that skin, without a hint of shame.
And now I know,
She’s got something that keeps her trussed,
And it shows off the goods to us,
And in my dreams, I’m touching her all over.
Well I just keep losing, because I have no game,
And yes I have messed up my life, and women call me lame.
And now I know,
Restraints produce great pornography,
Expose lingerie to me,
And fill my screen with babes to slobber over.
🎶 🎶 🎶 🎶
In related news, giga-Kojimblr has been released from Tumblr jail. It got flagged a few days back, which forces you to log in before viewing “sensitive” content. Since I have no intention of creating a Tumblr account, I had to cross my fingers and hope it’d eventually be released. Honestly, I’m surprised it hasn’t been permabanned by now; that’d be, what, the fifth time?
[Sunday update: …and, back into Tumblr jail it goes, sigh.]
[Sunday afternoon update: to my immense surprise, it seems I do have a Tumblr account; they just sent me email saying it’s been a long time since I used my account, and do I still want to keep the username? Apparently I created it a few years ago to add a comment to someone’s “where is this place in Kyoto” request. Don’t bother looking for my empty blog there; it’s marked private, and I wouldn’t use it anyway, given how hostile they are to their users.]
[updated with almost all their names]
Before I deleted the 2,000+ duplicate images I found with PDQ,
I did a lot of sampling to make sure there were no false positives.
The default distance within which
clusterize256 considers two images
to be the same is 31, which looks like the sort of number you’d come
up with after testing your code against a large set of known data.
Now that I have ~16,000 de-duped images, I decided to see what would happen if I bumped that up.
First I tried 50, which found a number of real dupes where the differences consisted of minor changes in cropping, focus, and exposure (at the level of Photoshop’s auto-level function), as well as significant text additions. It also had a fair number of false positives, however, mostly photos of the same model with slightly different expressions or head positions (eyes open/closed, smile/not, face turned a few degrees, etc); if they should be considered dupes at all, the resolution process has to be manual (we used to call it “editing your damn photoshoot”…).
Then I tried 40, which reduced but didn’t eliminate the false face positives. 35 left me with only one false match (below), but also didn’t pick up some of the real near-duplicates.
Then there was this pair…
Rakugaki (落書き) can refer to scribbles or graffiti, but on Pixiv, it’s usually more of a quick sketch. And some artists’ “quick sketches” are quite impressive.
Let’s lead off with the extremely rare triple half-rims!
Inevitably, there are duplicate images in my cheesecake archives. Sometimes it’s the exact same file with a different name, which I can detect with a simple MD5 checksum, but often they’re different sizes, or some site has added a watermark, or a magazine overlayed it with text, or someone cropped off the text that someone else added, etc, etc.
Enter PDQ, an image-similarity hashing system that works pretty darn well. Despite coming from the evil facebook empire (usable for detecting kiddie-pr0n and wrongthink memes), the code is pretty decent, compiles cleanly, and only blows up if you feed it a file that doesn’t contain a single image convertable with ImageMagick (pro tip: do not run it on a directory that contains a video file; your swapfile will thank me). A quick review of the images it clustered together confirmed that fully 11% of my images were duplicates.
So what better for a cheesecake theme than images I liked so much I managed to download them at least four times? (not counting any copies I’ve already posted and deleted from the archive, of course; I’ll have to go through my S3 backups sometime to find those)
The following de-duplication recipe uses
Miller to process the output;
I’d somehow overlooked this tool for years, and I can think of at
least one project at work that I wouldn’t be stuck maintaining any
more if it were a directory full of
mlr recipes instead of Perl
# gather up all your image files # find . -type f -name '[0-9a-zA-Z]*.[pjPJ]*' | sort > /tmp/images # edit the list to remove anything that's not an image (text, video, # etc); also sanity-check for annoying file names (containing things # like commas(!), whitespace, quotes, parentheses, etc) # generate the hashes; this is the tedious part # (~13/sec on my 12-inch MacBook with images stored on an external SSD) # pdq-photo-hasher -d -i < /tmp/images > /tmp/hashes # cluster similar images, then strip out all images with # cluster-size=1 (unique) # clusterize256 /tmp/hashes | mlr filter '$clusz > 1' > /tmp/alldupes # extract their filenames # mlr --onidx cut -f filename /tmp/alldupes > /tmp/files # create file containing (filename, height, size) for all images # xargs identify -format 'filename=%i,height=%h,size=%B\n' \ < /tmp/files > /tmp/meta # join it to the original, for consolidated output # mlr join -j filename -f /tmp/meta /tmp/alldupes > /tmp/alldupes2 # for each cluster, keep the file with the largest (height, size) # mlr sort -nr height,size then \ head -n 1 -g clidx then \ sort -n clidx then \ cut -f filename /tmp/alldupes2 > /tmp/keep # create the complementary set of images to delete # fgrep -v -f /tmp/keep /tmp/alldupes2 | mlr --onidx cut -f filename > /tmp/nuke # move the dupes to another directory # (rather than deleting them immediately...) # mkdir -p DUPES mv $(</tmp/nuke) DUPES
When you add additional images to your collection, you can generate their hashes and compare them to the existing data (amusingly, you have to use the tool backwards…):
# hash the new images # pdq-photo-hasher [0-9a-zA-Z]*.[pjPJ]* > /tmp/newstuff # print the filenames of new dupes # (note that mih-query is a bit twitchy about formatting; the # hash field must be first, and non-pdq fields need to be at # the end) # mih-query /tmp/hashes /tmp/newstuff | grep match= | mlr --onidx cut -f 4 # add remaining hashes to your DB of unique images
Bonus for correctly guessing which image I had eight copies of. 😁
As popular and photogenic as she is, it’s a bit surprising that they just keep recycling the same handful of photoshoots of Asuka Kishi (岸明日香). Then again, she’s only had 3 photobooks in six years compared to 15 DVDs, so perhaps her fans just want to see them bounce.
The Pixiv tag “R-18” means different things to different artists. For some, it’s equivalent to “NSFW”, but for many, it’s closer to “includes censored penetration”, and quite often “my mother would kick me out of the basement if she knew I fantasized about this”, with anything not quite as raw left open for all to see.
Gratuitous Mahoro to make it clear that this set has work-safety issues:
[amusing note: Google image search thinks this Mahoro image is related to the Mongolian Society of Interventional Radiology]
The Naming Of Cheesecake’s a bit cloak-and-dagger,
It isn’t just one of your image search games.
You may think that I’m just a lazy-ass blogger,
When I tell you, these models have actual names.