How accurate is `md5sum`?

5,724

Solution 1

MD5 is broken for this purpose against an intelligent adversary. It is possible to maliciously construct two different blocks of data that produce the same MD5 hash.

However, it is entirely suitable (though there are almost certainly better ways) to use MD5 to protect against inadvertent data corruption in transit or in storage. While it is conceivable that such an event could cause the MD5 hash to be the same, the probability is so low that it's almost unimaginable that it would be a probability worth worrying about. Failures caused by background radiation, tunneling, static, and dozens of other sources would be orders of magnitude more probable.

Even if you had a quadrillion units of data, the probability that a mismatched MD5 would produce an MD5 hash belonging to one of those quadrillion units is much less than one in a quadrillion.

Solution 2

MD5 is a hash. It basically maps the entire content of a file into a small string which is 16 bytes long IIRC.

There will obviously be multiple files which hash to the same MD5 sum. Therefore, a matching MD5 sum is no guarantee of an exact match between files.

There is no threshold as such because the of the way hashes work. So an MD5 sum can detect even a single bit change. However, lots of single bit changes together may cause the MD5 hash to be the same. It is therefore quite reasonable to use MD5 to validate file integrity against random corruption but no if malicious intent is possible as someone could modify a file while making sure the MD5 hash is the same.

Solution 3

An MD5-Hash consists of 128bits. A single flipped bit in the source flips (on average) 64 bits in the hash.

Probability of two hashes accidentally colliding is 1/2^128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all hashes then thanks to birthday paradox probability is a bit higher. To have 50% chance of any hash colliding you need 2^64 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Source: porneL, https://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions

Share:
5,724

Related videos on Youtube

Konner Rasmussen
Author by

Konner Rasmussen

Updated on September 18, 2022

Comments

  • Konner Rasmussen
    Konner Rasmussen almost 2 years

    When using md5sum to verify the integrity of a file, how accurate is the process?

    Does a verified MD5 mean that EVERY bit is exactly the same, or is there a threshold that must be broken before binary alteration is reflected in the MD5?

    Any documentation on how an md5 is generated would also be appreciated.

    • choroba
      choroba about 10 years
    • Konner Rasmussen
      Konner Rasmussen about 10 years
      @choroba if i am reading all of this correctly, then the probability of undetected alteration increases with the size of the file being verified. However it is still a probability and 100% certainty can only be had with a 16 byte file. Am i correct?
    • Doktoro Reichard
      Doktoro Reichard about 10 years
      @Konner I read your question that dealt with file transfer. For that purpose, it is safe. The chances of a bit being corrupted on a transfer are small and it is more likely that you lose some part of the copied file on the transfer, assuming I understood your previous situation. MD5 is in effort equivalent to comparing both files, with some advantage in the fact that you don't need to access both files at the same time.
    • Michael Martinez
      Michael Martinez about 10 years
      If I really want to make sure two files are identical, I will do a "cmp -l".... Just lettin' you know
    • Konner Rasmussen
      Konner Rasmussen about 10 years
      @MichaelMartinez that would require two copies of the file, which could prove to be unreasonable if the file is too large. furthermore i am aware of the cmp command. i appreciate the input though... =)
    • Plutor
      Plutor about 10 years
      "Does a verified MD5 mean that EVERY bit is exactly the same." Note that this is impossible, thanks to the Pigeonhole Principle. en.wikipedia.org/wiki/Pigeonhole_principle
    • Mooing Duck
      Mooing Duck about 10 years
      I don't think you get a guarantee for 16 byte files either.
  • Itai
    Itai about 10 years
    @KonnerRasmussen - Correct. You can verify a backup with MD5 for example but it may happen that an error goes through undetected.
  • Michael Lorton
    Michael Lorton about 10 years
    @KonnerRasmussen -- it's not a matter of the degree of repercussions, it's the nature of the threat. If you are worried that two documents might accidentally have the same MD5, stop worrying: the odds are considerably higher that your computers will spontaneously burst into flames; if you are worried an intelligent attacker might produce a document that matches one you already have, that's a serious concern and you should get a better hash; if you are worried an intelligent attacker might produce a two documents that match each other, don't "worry": it will certainly happen.
  • Shadur
    Shadur about 10 years
    Of course, while generating an MD5 hash collision is theoretically possible, generating a useful collision (as in, the colliding file is the same type of file and its contents are at least plausibly authentic) is a lot harder...
  • MSalters
    MSalters about 10 years
    @Shadur: That used to be the case, but ongoing security research has discovered new ways to generate MD5 collisions which make that easier. In particular, if your file format allows chunks of free-format "comment" data it's possible to match any MD5 hash by inserting a suitable comment.
  • Holloway
    Holloway about 10 years
    So you're saying there's still a chance? :p
  • deroby
    deroby about 10 years
    The trouble is that although you NEED to hash 6 billion files per second for 100 years to be certain that you find a collision; it could well happen within the first second.
  • Zsolt Szilagy
    Zsolt Szilagy about 10 years
    True, it could happen the first second. But as always it´s about balancing reasons. Ther might be military applications where that risk is not acceptable, but I would rent a car without second thought where the airbag sensors use md5. Remember, chances are that you get hit by lightning multiple times while waiting for a md5 collision.
  • fischi
    fischi about 10 years
    I am sorry, but half of 2^128 is 2^127, not 2^64.
  • obecalp
    obecalp about 10 years
    @fischi, read the link on the birthday paradox, and also en.wikipedia.org/wiki/Birthday_attack - it is not a matter of simply halving the total number of hashes. Given search space H, the number of hashes you have to generate before getting a 50% chance of a collision is approximately sqrt((pi/2) * H). If you do that math with 2^128, you'll get a number approximately 2^64
  • Barmar
    Barmar about 10 years
    While it may be theoretically possible to generate two files with the same hash, it may be practically infeasible. Especially if the replacement file should also make sense. If the original file was English text, for instance, there may not be any other match that's also English. Or if it's an Excel spreadsheet, none of the other files with the same hash would be valid spreadsheets.
  • philfr
    philfr about 10 years
    @Barmar : win.tue.nl/hashclash/Nostradamus these guys created a number of pdf files with the same MD5 hash to prove it was practically feasible.