Is MD5 still good enough to uniquely identify files?

hash md5

66,169

Solution 1

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

Solution 2

For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.

Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.

Solution 3

MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.

Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).

Solution 4

An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.

I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.

You may read on the pigeonhole principle.

Solution 5

I wouldn't recommend it. If the application would work on multi-user system, there might be user, that would have two files with the same md5 hash (he might be engineer and play with such files, or be just curious - they are easily downloadable from http://www2.mat.dtu.dk/people/S.Thomsen/wangmd5/samples.html , I myself during writing this answer downloaded two samples). Another thing is, that some applications might store such duplicates for whatever reason (I'm not sure, if there are any such applications but the possibility exists).

If you are uniquely identifying files generated by your program I would say it is ok to use MD5. Otherwise, I would recommend any other hash function where no collisions are known yet.

View more solutions

66,169

Ranhiru Jude Cooray

Technophile, Vim Addict, Cryptocurrency Enthusiast

Updated on March 26, 2020

Comments

Ranhiru Jude Cooray over 4 years

Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts?
- Not Available over 13 years
  
  I am actually currently using it myself in one of my applications, and as far as I'm aware it's good enough to uniquely identify files.
- sharptooth over 13 years
  
  You will likely find this question: stackoverflow.com/questions/862346/… useful.
- Marcin over 13 years
  
  How many files do you need to identify? It outputs 128bits, so if you're trying to identify few thousands of files, it's fine. But if you're trying to id a lot more than that, you might be bumping into collisions/the birthday paradox.
- Ranhiru Jude Cooray over 13 years
  
  They are going to be image files, jpg's, png's and gif's. And yes i think the limit would be a few thousand... But how many files do you roughly think is going to cause me trouble?
- NeDark about 10 years
  
  Related: stackoverflow.com/questions/14973197/…
- Keith Thompson over 8 years
  
  @Marcin: You're not going to run into the birthday paradox unless you have quintillions of files. You don't.
- ADTC over 6 years
  
  Just in case you're trying to find duplicate photos, you're better off using a software that's specific to the purpose than comparing file hashes. Photos that look exactly the same or very similar can have different file hashes depending on differences in the photo's metadata (like resolution, EXIF data, etc).
none over 13 years

could you elaborate on, what is broken in security perspective? what could not be achieved with md5 hashing? why is it not achievable?
Alex S over 13 years

@none: For your first question, see here. I'm afraid I don't understand the other questions.
Alex S over 13 years

@0xA3: Neither you nor I have any idea what files the OP is referring to, or how much damage a compromise would cause. It could be their kid's baby photo collection for all we know. My goal is to provide the facts; what someone else does with them is their business. Also consider that Bruce Schneier recommends writing down your password; not everything needs to be stored at Fort Knox. Some things will keep just fine under the flower pot.
Alex S over 13 years

Add to all that the fact that MD5 hashes work much better than, say, SHA1 hashes as database keys since they fit neatly into a UUID column.
hpavc over 13 years

@Marcelo Cantos, I think what is lacking here is a differentiation or unpacking of the term 'security'. Obviously people are assuming 'security' for any use of checksum work, but the nomenclature Marcelo likely means is 'in a laboratory'.
Ranhiru Jude Cooray over 13 years

But a hashing algorithm creates the hash by going through all the bytes, right? So don't you think the same purpose is achieved? :)
PaulG over 13 years

@Ranhiru. No. The hash gives you a 'summary' value which (for MD5) is only 16 bytes long. To guarantee the files are identical you would need to make a byte by byte check. This is true no matter what hash algorithm you choose, there is always the possibility of a collision.
Ranhiru Jude Cooray over 13 years

But MD5 can be used and is used to uniquely identify files, is it? Is it safe to use MD5? Will i generate collisions eventually?
PaulG over 13 years

This answer might be a bit misleading if the reader isn't too familiar with hashing. There is nothing magical about SHA that prevents hash collisions, they are just more resistant to hash collision attacks. If you wanted to be more than 99.999^e% certain that files are identical, you would still need a byte by byte check.
PaulG over 13 years

@Ranhiru. Reread this answer, its imho the most comprehensive one here. Hashing could be used as a first step, which gets you to 99.99^e% certainty that the files are identical, but if you want to be absolutely 100% certain, then you'll need to make a byte by byte check. This is true whether you use MD5, SHA or any other algorithm.
Thomas Pornin over 13 years

Actually a byte-to-byte comparison may fail due to a cosmic ray flipping a bit (e.g. transforming a return 0; into a return 1;). This is highly unlikely, but the risk of a collision with SHA-256 is even smaller than that. Mathematically, you cannot be sure that two files which hash to the same value are identical, but you cannot be sure of that either by comparing the files themselves, as long as you use a computer for the comparison. What I mean is that it is meaningless to go beyond some 99.999....9% certainty, and SHA-256 already provides more than that.
PaulG over 13 years

What, you don't use ECC memory? ;). Good comment, very interesting thoughts.
Alex S over 13 years

@hpavc: No, I don't mean in a laboratory. MD5 has been broken badly enough that a hacker can use a notebook computer to generate pairs of documents with matching hashes (a collision) in under a minute. They could use this to, e.g., present an honest program for audit purposes, and slip in a trojan once the audit is complete. Any system relying on MD5 hashes to prevent tampering would not notice this. I don't know if it is computationally feasible, yet, to forge a preexisting document (a 2nd pre-image attack). But no one with a shred of sense is going to bet the bank on that.
Alex S over 13 years

This answer is wrong. Prevention of tampering and verifying uniqueness are the same thing. Also, while hashing doesn't guarantee uniqueness, neither does actual comparison. In fact, the likelihood of a hash accidentally colliding is actually lower that the probability of the comparison failing due to glitches in the CPU generated by normal solar gamma ray emissions. And don't forget that often the only source of the file is sitting on the other side of the world inside a web server, and the only independent piece of information you have for comparison purposes is the hash.
Alex S over 13 years

@hpvac: Also, by 'security', I don't mean any checksum work. A compiler cache might use MD5 (or even MD4!) to check whether input files match those of a previous build.
Gumbo over 13 years

@Marcelo Cantos: You can use CHAR(20) for SHA-1 if you want to.
Alexander H over 13 years

Agree with Marcelo: telling that version A is (probably) the same as version B is exactly the same problem as telling that file X is (probably) the same file as file Y. A hash designed to be good (but not perfect) for one is a hash designed for the other.
PaulG over 13 years

@Marcelo. It doesn't stand to logical reasoning that accidental collision is less likely than accidental bit flips (whilst making a byte by byte comparison). You still have the same chance of bit flips when building the hash (and arguably more since more processing time is involved). @Thomas raised the point originally to suggest that there is no guaranteed way of identifying uniqueness, though the impact of bit flips is highly debatable. The most pessimistic estimate is 1 flip per GB/hour, and ECC RAM would remove even that.
Alex S over 13 years

@PaulG: That's not the point. The probability of an accidental collision at the mathematical level is much lower than an error due to a random bit flip (and ECC can't prevent bit-flips in the bus circuitry or the CPU core, btw). Thus, a byte-for-byte comparison would have almost no impact on the chances of getting it right. Besides, the answer is wrong even in principle, since the main purpose of a hash is to confirm the identity of a file when there is no trusted copy to check against, or it is intractable to do so (e.g., comparing a 100 GB file with a copy on the other side of the world).
Alex S over 13 years

@Gumbo: You could, but that's 2.5 times the size of a UUID column.
Gumbo over 13 years

@Marcelo Cantos: Sorry, I meant CHAR(20) for 160 bit.
Alex S over 13 years

@Gumbo: I wouldn't be game to use CHAR(20), since that would entail storing binary data in a column intended for text. BINARY(20) would suffice, but that comes back in C# as a byte[], which is awkward to work with (e.g., no comparison operators). Similar problems probably occur in other languages. A UNIQUEIDENTIFIER, OTOH, maps onto a Guid, which is easily compared and is a value type, so less GC pressure.
President James K. Polk over 13 years

This last point is worth reiterating. If you already had two full alleged copies of the file, A and A', then you might as well do a byte-for-byte comparison because it is at least as good as hashing and comparing but much faster.
sillyMunky almost 12 years

You don't have to compare the files byte for byte to be 'certain'. Using two hashing algorithms will suffice since the probability of a mutual collision vanishes to be smaller than the probability of bit being flipped in your hard drive by cosmic rays.
James P. almost 11 years

Don't forget the tin foil hat ! More seriously, how do you know these factoids about collisions and have you verified this in some way ?
endolith almost 9 years

"the likelihood of a hash accidentally colliding is actually lower that the probability of the comparison failing due to glitches in the CPU generated by normal solar gamma ray emissions" [citation needed]
stapeluberlauf over 8 years

@endolith: That probability totally depends on which hash function you are using and how collision resistant it is. And as Thomas Pornin says, if you have an adversary you shouldn't be using MD5.
endolith over 8 years

@sillymunky Using two hashing algorithms is a waste of time and energy since you have to read the files byte for byte to hash them anyway.
Alex S over 8 years

@endolith en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electronics cites one flip per 256 MB-month. That's 1 flip per second in every 2^52 bits. A spurious pair-wise MD5 collision would occur once in every 2^128 comparisons, assuming no malice. I will concede that "gamma rays" is probably wrong. It's more likely to be beta particles.
Alex S over 8 years

@endolith Wrt to the waste of time and energy, you clearly haven't bothered to read the comment trail, in which it was pointed out that the usual purpose of hash-comparisons is to verify copies that can't be easily brought together for direct comparison.
Alex S over 8 years

@endolith In fact, it appears you didn't even bother to read the second half of the post for which you demanded a citation.
endolith over 8 years

@MarceloCantos Random bit flips affect both methods equally, so the hashing method is definitely worse in every way and is only useful when both files are not local.
endolith over 8 years

@ThomasPornin Cosmic ray bit flips would affect the MD5 method too, so it is still worse.
Alex S over 8 years

@endolith Where did I claim otherwise?
Olivier Dulac about 8 years

I strongly disagree. A different hash value tells that the files are different. But for an equal hash value: you can't say "it is highly likely both are the same" if the hash are the same : you can only compare byte-for-byte. A hash is many orders of magnitude smaller than the number of different values for the whole file, so there are many, many, many possible collisions for each hash values. Only if you are in the case of copying a known file (with a known hash) does an identical hash value "probably means" the 2nd was copied correctly (even then, it's not 100% sure, but highly likely).
Olivier Dulac about 8 years

So in the OP's question, asking "does 2 files with identical hash value be considered the same", the answer is a clear "no". If the 2nd is not supposed to be a copy of the first, they could just be 2 files which happen to be reduced to the same hash value. That is not unlikely if you have many files.
Alex S about 8 years

@OlivierDulac your analysis is flawed. It doesn't matter how small the hash is relative to a file. What matters is what proportion of all possible files have the same hash as a given file. For MD5, a hash occupies one in 2^128 of the hash space, so, of all possible files, only one in 2^128 will collide with it. That's a very small needle in a very large haystack. Of course, this presupposes MD5 is a good hash, which we now know it isn't. But assuming no malice, the statement "it is highly likely both are the same" still holds.
Alex S about 8 years

@OlivierDulac to your "many files" point, you are strictly correct, but the numbers involved make the point irrelevant for most practical scenarios. For a 2^128 search space, the probability of a collision among 2^64 (aka 16 quintillion) files is about 1/2, and for a billion files it's roughly one in 2^110 (aka bugger all). So, if you happen to have 16 quintillion files on hand, then MD5 is probably not for you, but for an every-day interpretation of "many", an accidental collision is still highly unlikely.
Olivier Dulac about 8 years

@MarceloCantos: I am adressing the answer based on the question : "considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc?". Even with the first part (considered a good enough method to uniquely identify it) : Good enough if you have not too many files, and if a collision is not a big problem (they will/can happen). and with the whole question: it's really easy for a hacker to put a file with the same hash as another, and your program could then take the hacked file instead of the right one. So no, it is not good enough, at all.
Alex S about 8 years

@OlivierDulac malicious collisions were addressed in my answer. So, what exactly is it that you strongly disagree with?
Alex S about 8 years

OK, my math sucks. GUIDs have about 122 bits of entropy, and so the probability of a collision anywhere in a billion files is about 2^(2*30 - 122) = 2^-62. While this is much higher than my original calculation, it's still minuscule at roughly one in 4-quintillion.
NullUserException over 7 years

@MarceloCantos Actually, due to the birthday paradox a spurious MD5 collision is likely to occur (p>0.5) once you have 2^64 values. Granted, this number is still infinitesimally small and your point still stands.
Alex S over 7 years

@NullUserException The birthday paradox says the probability of a pairwise collision being found anywhere in a set of 2^64 elements is about 0.5. It says nothing about 2^64 pairwise comparisons.
NullUserException over 7 years

@MarceloCantos And that's different... how?
Alex S over 7 years

@NullUserException a set of 2^62 elements produces about 2^128 pairings, which are far more likely to yield a collision than 2^64 independent pairwise comparisons.
Wolf5 over 7 years

Smart idea. If an "attacker" makes a fake file with the same md5 hash, it will not help unless he knows your "salting", and reversing the contents would create a different hash. Using 2 md5 keys like that would reduce the odds a lot. If its just to prevent an "attack" using a salt before calculating locally will be enough.
in70x over 7 years

Fellas, I think you are getting a bit to caught up in mathematical/philosophical possibilities here. In the name of pragmatism both MD5/SHA-256 and other well known hashing algorithms a more than strong enough for use in determining file equivalency. If you want to be super sure (aka paranoia), perhaps run two different hash algorithms on the file pair. I would assume given two files I would see if the sizes are equivalent, if so only then proceed to hash the file.
Bert Sinnema almost 7 years

This is a perfectly good answer. MD5 is unacceptable for use cases in Law and Accounting in Europe from May 2018 on.
berezovskyi over 6 years

@BertSinnema could you point me to the source that defines which hash functions are acceptable etc., please?
berezovskyi over 6 years

@GregSchmit maybe because OP did not care about cryptographic strength per se. I understood the question as "I already use MD5 in non-security context, do I need to spend time to update the code?" kind of thing. And in this context the answer was likely wrong and SHA1 has been broken since too.