More fun with duplicate cheesecake


Before I deleted the 2,000+ duplicate images I found with PDQ, I did a lot of sampling to make sure there were no false positives. The default distance within which clusterize256 considers two images to be the same is 31, which looks like the sort of number you’d come up with after testing your code against a large set of known data.

Now that I have ~16,000 de-duped images, I decided to see what would happen if I bumped that up.

First I tried 50, which found a number of real dupes where the differences consisted of minor changes in cropping, focus, and exposure (at the level of Photoshop’s auto-level function), as well as significant text additions. It also had a fair number of false positives, however, mostly photos of the same model with slightly different expressions or head positions (eyes open/closed, smile/not, face turned a few degrees, etc); if they should be considered dupes at all, the resolution process has to be manual (we used to call it “editing your damn photoshoot”…).

Then I tried 40, which reduced but didn’t eliminate the false face positives. 35 left me with only one false match (below), but also didn’t pick up some of the real near-duplicates.

Then there was this pair…


The distortion of her left arm tells you how this effect was achieved. 😄

Increasing the distance parameter also increases the runtime, but even at 50 it only took about a minute, so it’s worth doing, but a human eye will be needed to decide which one to keep.


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.