How many random elements before MD5 produces collisions?
Solution 1
Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.
However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.
Solution 2
S3 can have subdirectories. Just put a "/" in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3.
For example: "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a file system, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory".
Solution 3
So wait, is it:
md5(filename) + timestamp
or:
md5(filename + timestamp)
If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.
Solution 4
A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.
Solution 5
Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.
Related videos on Youtube
Ben Throop
Updated on March 02, 2020Comments
-
Ben Throop over 4 years
I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.
Do I need to worry about collisions in the MD5 hash value that gets produced?
Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?
-
glerYbo over 6 years
-
Rick James over 6 yearsThe literal answer is that the second file could have the same MD5 as the first. However the odds are extremely small.
-
-
Will Dean over 15 yearsThere may of course be many other bad things which can happen with a probability of 1/2^128. You might not want to single-out this one to worry about.
-
JesperE over 15 yearsYou probably mean 128 bits, not 2^128. :-)
-
Jim C over 15 yearsThe worst thing that can happen here is you can get a photo. For a relatively small number I would not worry. Now if your software is controlling an autopilot landing an aircraft, thats another story.
-
Stefan over 15 yearsen.wikipedia.org/wiki/Birthday_Problem Some more information about the problem.
-
Kornel over 15 yearsYou can't be serious. You'll need to hash 6 billion files per second, every second for 100 years to get good chance of collision. Even if you're very very unlucky, it would probably take more than entire capacity of S3 used for longer than a human lifetime.
-
Sam Saffron about 15 yearsThe only problem I have with taylors example is that if someone gets a copy of your database they could probably figure out the credit card numbers using a rainbow table ...
-
acrosman about 15 yearsWhile I wouldn't choose to use MD5 for credit cards, a Rainbow table of all valid credit card numbers between 10,000,000 (8 digits being the smallest length credit card I've seen) and 9,999,999,999,999,999 (largest 16 digit number) is still a big table to generate. There are probably easier ways to steal those numbers.
-
Artelius almost 15 yearsIt's billions of times more likely that your database and its backups will all fail. Collisions are not worth worrying about.
-
Mathias Bynens over 14 years+1 for adding the calculation. This is slightly more accurate:
http://www.google.com/search?q=2^64%2F100*(seconds+per+year)
-
ConcernedOfTunbridgeWells about 13 yearsNot strictly true. The probability of a collision is much higher than this as a new URL could potentially collide with any existing item in the table. See This posting (disclaimer, I wrote it) for a run-down on the maths, and a small python script that can be adapted to compute the probability for a particular number of URLs.
-
Kornel about 13 years@ConcernedOfTunbridgeWells: I did take correction for birthday paradox, which is why answer is in billions, not quintillions. I was unable to verify probability with your script
PV=2**128; SS=2**64
:OverflowError: long int too large to convert to int
-
BlueRaja - Danny Pflughoeft about 11 years"probability of collision is 1/2^64" - what? The probability of collision is dependent on the number of items already hashed, it's not a fixed number. In fact, it's equal to exactly
1 - sPn/s^n
, wheres
is the size of the search space (2^128
in this case), andn
is the number of items hashed. What you are probably thinking of is2^64
, which is the approximate number of items you'd need to MD5 hash to have a 50% chance of collision. -
Kornel about 11 years@BlueRaja-DannyPflughoeft that's what I had in mind indeed. Thanks for the correction.
-
Kmeixner over 10 years+1 because I've always wanted to know how to count past a 999 trillion lol (and oh yeah your answer was informative)
-
Bradley Thomas almost 10 yearsPlease elaborate on how including the timestamp increases the chance of collision
-
StackOverflowed over 9 yearsusing a salt would take care of the user engineering problem, no?
-
Vincent Hubert over 9 years@BradThomas : It does not. The MD5 risk of collision is the same whether it is on the filename or the combination of filename+timestamp. But in the first scenario, you would need to have both a MD5 collision and a timestamp collision.
-
bdonlan over 9 yearsIt depends on how the salt is applied. It would need to be a prefix of the user-supplied data, or better yet the key for an HMAC. It's still probably a good idea to practice defense in depth though.
-
Ian Clark over 9 yearsThis should be a content I think, as it doesn't actually answer the question about the likelihood of a collision
-
Jørgen Fogh over 9 yearsUnfortunately, you are still not correct. You are assuming that the hash function is truly random. It is not. This means that the collision probability is higher.
-
Kornel over 9 yearsJørgenFogh: And all laws of physics are "not correct" either. Such level of pedantism is unnecessary because it doesn't change the answer in any meaningful way.
-
Berry M. over 7 yearsThis still leaves a 2^(128^60) chance of a collission with two users per minute. Literally unusable.
-
Rick James almost 7 yearsMany of the other Answers talk about the probability of a collision when adding one more item. I think my Answer are more useful because it talks about the probably of the entire table having a dup.
-
Ry Biesemeyer over 6 years(This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.); incorrect. this means that by the time you've been hashing 6 billion files per second for 100 years, 50% of the hashes you are generating would collide with previously-generated hashes.
-
Kornel over 6 years@yaauie No, that's ridiculously impossible. I'm talking about generating 2^64 hashes out of 2^128 possible ones. That's one quadrillionth of a percent of all possible hashes generated.
-
robocat over 6 years@BradThomas To be clearer:
md5(filename) + timestamp
reduces the collision risk massively because you would need to have an md5 collision for exactly the same timestamp to have a collision overall.md5(filename + timestamp)
is the same asmd5(filename)
, assuming that filename is random to start with (because adding more randomness to something random only changes the individual md5 result and the birthday problem still exists across all the md5 hashes). -
robocat over 6 yearsIntuitively if we ignore the birthday paradox and just look at an approximate solution: Add
2^64
hashes into a list. Now add one more hash to that list. That one more hash has1 / 2^128
times2^64
chance of a collision, i.e. that one more hash has a1 / 2^64
chance of a collision. Now add another2^64
hashes to the list and you should get a collision. Do the same calculation for2^63
(and note2^63 + 2^63 = 2^64
). -
robocat over 6 yearsNote although SHA256 is 256 bits long, you can trade off the risk of collisions with the length of the key you are storing by truncating the SHA256 to fewer bits e.g. use SHA256 but truncate it to 128 bits (which is more secure than using MD5 even though they have the same number of bits).
-
vargonian about 6 yearsSo you’re saying there’s a chance!
-
polvoazul about 6 yearsUse the collision prevention time building a bunker to put your server! Those pesky meteors can hit you (very unlikely, but possible), so you'll need to support meteor shelter from the begging.
-
user327961 about 6 yearsIt would take 100 years to get a 50% chance of collision at 6G files / sec. You have a good chance of collision decades earlier.
-
Joonas Alhonen over 5 yearsThis has nothing to do with MD5 and is not correct. It's like saying that if you have 9 trillion cats there is a 1 in 9 trillion chance that someone else has a identical cat. The key problem here is that you can get same hash with more than one value.
-
Rick James over 5 years@JoonasAlhonen - Yes, that is true. And a lot of poor people use that as an excuse to buy yet another Lottery ticket they cannot afford.
-
Amirhosein Al almost 4 yearsCan I use this hash algorithm for filenames? Like hash the contents of files, set the name of those files to their respective hashes and store them in a directory? Maximum number of files in the directory at the same time is around 3000.
-
Kornel almost 4 years@AmirhoseinAl yes, for all practical purposes it will be as unique as the filenames.
-
rumata28 almost 4 yearsBad thing is that it someone could upload colliding files ON PURPOSE, which may lead to bugs or even worse - security breach, for example it could allow to override the file with other file. avira.com/en/blog/md5-the-broken-algorithm
-
Anurag Vohra almost 4 yearsdo this means "Don't worry" ? As my DB primary key are MD5 hashes !
-
Kornel almost 4 years@AnuragVohra Yes, you don't have to worry. The most probable collision there is an asteroid hitting earth.
-
bartolo-otrit over 3 yearsIf we take
2^64
random hashes out of2^128
, then according to the approximated formula from Birthday attack we have 0.39 chance of at least one value is chosen more than once, whereas for 2.2 * 10^19 hashes to choose we have 50% chance of at least one collision (see the table in the article)