How accurate is `md5sum`?
Solution 1
MD5 is broken for this purpose against an intelligent adversary. It is possible to maliciously construct two different blocks of data that produce the same MD5 hash.
However, it is entirely suitable (though there are almost certainly better ways) to use MD5 to protect against inadvertent data corruption in transit or in storage. While it is conceivable that such an event could cause the MD5 hash to be the same, the probability is so low that it's almost unimaginable that it would be a probability worth worrying about. Failures caused by background radiation, tunneling, static, and dozens of other sources would be orders of magnitude more probable.
Even if you had a quadrillion units of data, the probability that a mismatched MD5 would produce an MD5 hash belonging to one of those quadrillion units is much less than one in a quadrillion.
Solution 2
MD5 is a hash. It basically maps the entire content of a file into a small string which is 16 bytes long IIRC.
There will obviously be multiple files which hash to the same MD5 sum. Therefore, a matching MD5 sum is no guarantee of an exact match between files.
There is no threshold as such because the of the way hashes work. So an MD5 sum can detect even a single bit change. However, lots of single bit changes together may cause the MD5 hash to be the same. It is therefore quite reasonable to use MD5 to validate file integrity against random corruption but no if malicious intent is possible as someone could modify a file while making sure the MD5 hash is the same.
Solution 3
An MD5-Hash consists of 128bits. A single flipped bit in the source flips (on average) 64 bits in the hash.
Probability of two hashes accidentally colliding is 1/2^128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.
However if you keep all hashes then thanks to birthday paradox probability is a bit higher. To have 50% chance of any hash colliding you need 2^64 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.
Source: porneL, https://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions
Related videos on Youtube
Konner Rasmussen
Updated on September 18, 2022Comments
-
Konner Rasmussen almost 2 years
When using
md5sum
to verify the integrity of a file, how accurate is the process?Does a verified MD5 mean that EVERY bit is exactly the same, or is there a threshold that must be broken before binary alteration is reflected in the MD5?
Any documentation on how an md5 is generated would also be appreciated.
-
choroba about 10 years
-
Konner Rasmussen about 10 years@choroba if i am reading all of this correctly, then the probability of undetected alteration increases with the size of the file being verified. However it is still a probability and 100% certainty can only be had with a 16 byte file. Am i correct?
-
Doktoro Reichard about 10 years@Konner I read your question that dealt with file transfer. For that purpose, it is safe. The chances of a bit being corrupted on a transfer are small and it is more likely that you lose some part of the copied file on the transfer, assuming I understood your previous situation. MD5 is in effort equivalent to comparing both files, with some advantage in the fact that you don't need to access both files at the same time.
-
Michael Martinez about 10 yearsIf I really want to make sure two files are identical, I will do a "cmp -l".... Just lettin' you know
-
Konner Rasmussen about 10 years@MichaelMartinez that would require two copies of the file, which could prove to be unreasonable if the file is too large. furthermore i am aware of the
cmp
command. i appreciate the input though... =) -
Plutor about 10 years"Does a verified MD5 mean that EVERY bit is exactly the same." Note that this is impossible, thanks to the Pigeonhole Principle. en.wikipedia.org/wiki/Pigeonhole_principle
-
Mooing Duck about 10 yearsI don't think you get a guarantee for 16 byte files either.
-
-
Itai about 10 years@KonnerRasmussen - Correct. You can verify a backup with MD5 for example but it may happen that an error goes through undetected.
-
Michael Lorton about 10 years@KonnerRasmussen -- it's not a matter of the degree of repercussions, it's the nature of the threat. If you are worried that two documents might accidentally have the same MD5, stop worrying: the odds are considerably higher that your computers will spontaneously burst into flames; if you are worried an intelligent attacker might produce a document that matches one you already have, that's a serious concern and you should get a better hash; if you are worried an intelligent attacker might produce a two documents that match each other, don't "worry": it will certainly happen.
-
Shadur about 10 yearsOf course, while generating an MD5 hash collision is theoretically possible, generating a useful collision (as in, the colliding file is the same type of file and its contents are at least plausibly authentic) is a lot harder...
-
MSalters about 10 years@Shadur: That used to be the case, but ongoing security research has discovered new ways to generate MD5 collisions which make that easier. In particular, if your file format allows chunks of free-format "comment" data it's possible to match any MD5 hash by inserting a suitable comment.
-
Holloway about 10 yearsSo you're saying there's still a chance? :p
-
deroby about 10 yearsThe trouble is that although you NEED to hash 6 billion files per second for 100 years to be certain that you find a collision; it could well happen within the first second.
-
Zsolt Szilagy about 10 yearsTrue, it could happen the first second. But as always it´s about balancing reasons. Ther might be military applications where that risk is not acceptable, but I would rent a car without second thought where the airbag sensors use md5. Remember, chances are that you get hit by lightning multiple times while waiting for a md5 collision.
-
fischi about 10 yearsI am sorry, but half of 2^128 is 2^127, not 2^64.
-
obecalp about 10 years@fischi, read the link on the birthday paradox, and also en.wikipedia.org/wiki/Birthday_attack - it is not a matter of simply halving the total number of hashes. Given search space H, the number of hashes you have to generate before getting a 50% chance of a collision is approximately sqrt((pi/2) * H). If you do that math with 2^128, you'll get a number approximately 2^64
-
Barmar about 10 yearsWhile it may be theoretically possible to generate two files with the same hash, it may be practically infeasible. Especially if the replacement file should also make sense. If the original file was English text, for instance, there may not be any other match that's also English. Or if it's an Excel spreadsheet, none of the other files with the same hash would be valid spreadsheets.
-
philfr about 10 years@Barmar : win.tue.nl/hashclash/Nostradamus these guys created a number of pdf files with the same MD5 hash to prove it was practically feasible.