panthro's avatar

Allow duplicate image?

I'm building a site that allows user image uploads.

I'm considering whether I should allow duplicate images to be uploaded. By duplicate I mean, the exact 100% same image, and not cropped differently etc.

To do this I have considered hashing the image, and comparing the hash with a database of previous image hashes.

My question is, is this a waste of time? The main reason to do this would be to save storage space. But would the hashing process just end up costing me in computing costs anyway?

Should I not be concerned about duplicate image files?

0 likes
6 replies
MohamedTammam's avatar

For me, Yes, It's a waste of time. You can have the same image with the same look with different hash.

And the computing time will be huge to hash and check against or the database if it's big.

If you're worried about the space, let the user pay for it. The they will not waste money on duplicated images.

Snapey's avatar
Snapey
Best Answer
Level 122

consider the complexity

  • create a job to iterate over all images and create a hash
  • add somewhere to store reference to the image and the hash
  • when image uploaded, create hash of image after you have manipulated it in any way
  • search databases for instance of same hash
  • if found, associate image id with post (or whatever)
  • if model is deleted check if its image is being used by any other model
  • if not, its safe to delete
  • same for deleting user, cannot just cascade delete posts records since you have to check images
  • if user wants to replace image, check if current image is used elsewhere before replacing

Seems like a lot of work to me, whereas you could run the site for a year and write a task to count genuinely duplicate images. Then you will know if it is worthwhile.

1 like
mike_isp's avatar

I don't think the compute power will be a problem. I just tried getting an MD5 hash of the file contents of 2.5mb JPEGs and that typically took about 0.006 seconds or 6ms (On an M1 Pro, so maybe your server isn't quite that fast, but we're not talking hundreds of ms here.

You could possibly use a unique index on the hash column in the DB, then when inserting check if the failure was due to that unique collision and create a pointer reference.

I wouldn't bother if this was going to create lots of extra work, but if it's reasonable it could save storage. Like Snapey said, it may be worth measuring if it's a problem first. You could start collecting the hashes now without a unique index and without any de-duping, then after a while just query to count(*) and GROUP BY hash to see how many duplicates there are in practice.

If you did retro-fit the feature I would just prevent new dups being uploaded. You don't have to go back and fix all the old ones.

1 like
panthro's avatar

Thanks for your thoughts - it has given me a lot to think about.

haheap's avatar

I see what you're trying to do. I guess it depends on how the images are to be used. It could be worth it on a user base - so the same user cannot upload multiple versions of the same image, especially if it's for a image database of some sort. I wouldn't recommend any checks across users for the reasons @snapey mentioned.

1 like
panthro's avatar

@haheap thanks for the additional idea. I guess this entire thing is not a problem, until it's a problem! I am inclined to allow duplicates and check in 12 months and see how many duplicates there actually is!

1 like

Please or to participate in this conversation.