How does comparing images through md5 work?

28,605

Solution 1

An MD5 hash is of the actual binary data, so different formats will have completely different binary data.

so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)

This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)

Solution 2

It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.

Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.

Solution 3

md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.

So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.

Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.

Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.

Solution 4

A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".

To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.

Solution 5

If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.

View more solutions

28,605

TreeTree

Updated on July 11, 2021

Comments

TreeTree almost 3 years

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.
- Your Common Sense about 13 years
  
  A hash will let you compare 32-bit long hashes only, which is significally faster than comparing images itself.
GolezTrol about 13 years

I'd like to add, that I wouldn't trust this method completely. Any duplicate images found should still be checked by their exact content.
Piskvor left the building about 13 years

That's unnecessary, paranoid even. See this for the math: stackoverflow.com/questions/537989/…
GolezTrol about 13 years

Yeah, probably. But I learned you can never be sure. It's probably 3 lines of code to do a binary compare, and you only need to do that if the two md5s do indeed match and other simple checks (like file size, pixel size or meta information) match too. This will keep you from needing to hash every file. Only files with same sizes and meta info need to be hashed. I think that might give you a better optimization (especially when comparing larger images, like foto's) than just hashing every image you have and compare those hashes. But it depends on the situation of course.
GolezTrol about 13 years

I see now that it's not about files, but about screen shots. That will be of little use. Gif has a limited color depth and Jpeg suffers loss of quality due to the compression algorithm. It's improbably that a screenshot of a jpg and a screenshot of a gif will result in exactly the same image.
GolezTrol about 13 years

In that case it would be better to match pixel by pixel with a given tolerance. Pixels outside that tolerance could be counted as well. That will result in a certain 'distance' between two images. Images of the same size and with only a small distance are likely to display the same image.
David over 12 years

Thanks for tip on libpuzzle, we use PHP, but seems the library is currently more for *nix systems, no Windows, so less effective for us. Also wanted to point out, for those that want a more complete solution to image compare that's both a library and an end user tool, check out Sikuli.org. The tool is also cross-platform.
Mehdi almost 5 years

so If I have an image and compress it with jpeg for example, and do that 10 times on 10 different machines, will the compressed image always have the same md5? is jpeg compression deterministic?