How many random elements before MD5 produces collisions?

93,917

Solution 1

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Solution 2

S3 can have subdirectories. Just put a "/" in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3.

For example: "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a file system, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory".

Solution 3

So wait, is it:

md5(filename) + timestamp

or:

md5(filename + timestamp)

If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.

Solution 4

A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.

Solution 5

Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.

Share:
93,917

Related videos on Youtube

Ben Throop
Author by

Ben Throop

Updated on March 02, 2020

Comments

  • Ben Throop
    Ben Throop over 4 years

    I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.

    Do I need to worry about collisions in the MD5 hash value that gets produced?

    Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?

  • Will Dean
    Will Dean over 15 years
    There may of course be many other bad things which can happen with a probability of 1/2^128. You might not want to single-out this one to worry about.
  • JesperE
    JesperE over 15 years
    You probably mean 128 bits, not 2^128. :-)
  • Jim C
    Jim C over 15 years
    The worst thing that can happen here is you can get a photo. For a relatively small number I would not worry. Now if your software is controlling an autopilot landing an aircraft, thats another story.
  • Stefan
    Stefan over 15 years
    en.wikipedia.org/wiki/Birthday_Problem Some more information about the problem.
  • Kornel
    Kornel over 15 years
    You can't be serious. You'll need to hash 6 billion files per second, every second for 100 years to get good chance of collision. Even if you're very very unlucky, it would probably take more than entire capacity of S3 used for longer than a human lifetime.
  • Sam Saffron
    Sam Saffron about 15 years
    The only problem I have with taylors example is that if someone gets a copy of your database they could probably figure out the credit card numbers using a rainbow table ...
  • acrosman
    acrosman about 15 years
    While I wouldn't choose to use MD5 for credit cards, a Rainbow table of all valid credit card numbers between 10,000,000 (8 digits being the smallest length credit card I've seen) and 9,999,999,999,999,999 (largest 16 digit number) is still a big table to generate. There are probably easier ways to steal those numbers.
  • Artelius
    Artelius almost 15 years
    It's billions of times more likely that your database and its backups will all fail. Collisions are not worth worrying about.
  • Mathias Bynens
    Mathias Bynens over 14 years
    +1 for adding the calculation. This is slightly more accurate: http://www.google.com/search?q=2^64%2F100*(seconds+per+year)
  • ConcernedOfTunbridgeWells
    ConcernedOfTunbridgeWells about 13 years
    Not strictly true. The probability of a collision is much higher than this as a new URL could potentially collide with any existing item in the table. See This posting (disclaimer, I wrote it) for a run-down on the maths, and a small python script that can be adapted to compute the probability for a particular number of URLs.
  • Kornel
    Kornel about 13 years
    @ConcernedOfTunbridgeWells: I did take correction for birthday paradox, which is why answer is in billions, not quintillions. I was unable to verify probability with your script PV=2**128; SS=2**64: OverflowError: long int too large to convert to int
  • BlueRaja - Danny Pflughoeft
    BlueRaja - Danny Pflughoeft about 11 years
    "probability of collision is 1/2^64" - what? The probability of collision is dependent on the number of items already hashed, it's not a fixed number. In fact, it's equal to exactly 1 - sPn/s^n, where s is the size of the search space (2^128 in this case), and n is the number of items hashed. What you are probably thinking of is 2^64, which is the approximate number of items you'd need to MD5 hash to have a 50% chance of collision.
  • Kornel
    Kornel about 11 years
    @BlueRaja-DannyPflughoeft that's what I had in mind indeed. Thanks for the correction.
  • Kmeixner
    Kmeixner over 10 years
    +1 because I've always wanted to know how to count past a 999 trillion lol (and oh yeah your answer was informative)
  • Bradley Thomas
    Bradley Thomas almost 10 years
    Please elaborate on how including the timestamp increases the chance of collision
  • StackOverflowed
    StackOverflowed over 9 years
    using a salt would take care of the user engineering problem, no?
  • Vincent Hubert
    Vincent Hubert over 9 years
    @BradThomas : It does not. The MD5 risk of collision is the same whether it is on the filename or the combination of filename+timestamp. But in the first scenario, you would need to have both a MD5 collision and a timestamp collision.
  • bdonlan
    bdonlan over 9 years
    It depends on how the salt is applied. It would need to be a prefix of the user-supplied data, or better yet the key for an HMAC. It's still probably a good idea to practice defense in depth though.
  • Ian Clark
    Ian Clark over 9 years
    This should be a content I think, as it doesn't actually answer the question about the likelihood of a collision
  • Jørgen Fogh
    Jørgen Fogh over 9 years
    Unfortunately, you are still not correct. You are assuming that the hash function is truly random. It is not. This means that the collision probability is higher.
  • Kornel
    Kornel over 9 years
    JørgenFogh: And all laws of physics are "not correct" either. Such level of pedantism is unnecessary because it doesn't change the answer in any meaningful way.
  • Berry M.
    Berry M. over 7 years
    This still leaves a 2^(128^60) chance of a collission with two users per minute. Literally unusable.
  • Rick James
    Rick James almost 7 years
    Many of the other Answers talk about the probability of a collision when adding one more item. I think my Answer are more useful because it talks about the probably of the entire table having a dup.
  • Ry Biesemeyer
    Ry Biesemeyer over 6 years
    (This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.); incorrect. this means that by the time you've been hashing 6 billion files per second for 100 years, 50% of the hashes you are generating would collide with previously-generated hashes.
  • Kornel
    Kornel over 6 years
    @yaauie No, that's ridiculously impossible. I'm talking about generating 2^64 hashes out of 2^128 possible ones. That's one quadrillionth of a percent of all possible hashes generated.
  • robocat
    robocat over 6 years
    @BradThomas To be clearer: md5(filename) + timestamp reduces the collision risk massively because you would need to have an md5 collision for exactly the same timestamp to have a collision overall. md5(filename + timestamp) is the same as md5(filename), assuming that filename is random to start with (because adding more randomness to something random only changes the individual md5 result and the birthday problem still exists across all the md5 hashes).
  • robocat
    robocat over 6 years
    Intuitively if we ignore the birthday paradox and just look at an approximate solution: Add 2^64 hashes into a list. Now add one more hash to that list. That one more hash has 1 / 2^128 times 2^64 chance of a collision, i.e. that one more hash has a 1 / 2^64 chance of a collision. Now add another 2^64 hashes to the list and you should get a collision. Do the same calculation for 2^63 (and note 2^63 + 2^63 = 2^64).
  • robocat
    robocat over 6 years
    Note although SHA256 is 256 bits long, you can trade off the risk of collisions with the length of the key you are storing by truncating the SHA256 to fewer bits e.g. use SHA256 but truncate it to 128 bits (which is more secure than using MD5 even though they have the same number of bits).
  • vargonian
    vargonian about 6 years
    So you’re saying there’s a chance!
  • polvoazul
    polvoazul about 6 years
    Use the collision prevention time building a bunker to put your server! Those pesky meteors can hit you (very unlikely, but possible), so you'll need to support meteor shelter from the begging.
  • user327961
    user327961 about 6 years
    It would take 100 years to get a 50% chance of collision at 6G files / sec. You have a good chance of collision decades earlier.
  • Joonas Alhonen
    Joonas Alhonen over 5 years
    This has nothing to do with MD5 and is not correct. It's like saying that if you have 9 trillion cats there is a 1 in 9 trillion chance that someone else has a identical cat. The key problem here is that you can get same hash with more than one value.
  • Rick James
    Rick James over 5 years
    @JoonasAlhonen - Yes, that is true. And a lot of poor people use that as an excuse to buy yet another Lottery ticket they cannot afford.
  • Amirhosein Al
    Amirhosein Al almost 4 years
    Can I use this hash algorithm for filenames? Like hash the contents of files, set the name of those files to their respective hashes and store them in a directory? Maximum number of files in the directory at the same time is around 3000.
  • Kornel
    Kornel almost 4 years
    @AmirhoseinAl yes, for all practical purposes it will be as unique as the filenames.
  • rumata28
    rumata28 almost 4 years
    Bad thing is that it someone could upload colliding files ON PURPOSE, which may lead to bugs or even worse - security breach, for example it could allow to override the file with other file. avira.com/en/blog/md5-the-broken-algorithm
  • Anurag Vohra
    Anurag Vohra almost 4 years
    do this means "Don't worry" ? As my DB primary key are MD5 hashes !
  • Kornel
    Kornel almost 4 years
    @AnuragVohra Yes, you don't have to worry. The most probable collision there is an asteroid hitting earth.
  • bartolo-otrit
    bartolo-otrit over 3 years
    If we take 2^64 random hashes out of 2^128, then according to the approximated formula from Birthday attack we have 0.39 chance of at least one value is chosen more than once, whereas for 2.2 * 10^19 hashes to choose we have 50% chance of at least one collision (see the table in the article)