Probability of hash collision

12,502

Let's imagine we have a truly random hash function that hashes from strings to n-bit numbers. This means that there are 2n possible hash codes, and each string's hash code is chosen uniformly at random from all of those possibilities.

The birthday paradox specifically says that once you've seen roughly √(2k) items, there's a 50% chance of a collision, where k is the number of distinct possible outputs. In the case where the hash function hashes to an n-bit output, this means that you'll need roughly 2n/2 hashes before you get a collision. This is why we typically pick hashes that output 256 bits; it means that we'd need a staggering 2128 ≈1038 items hashed before there's a "reasonable" chance of a collision. With a 512-bit hash, you'd need about 2256 to get a 50% chance of a collision, and 2256 is approximately the number of protons in the known universe.

The exact formula for the probability of getting a collision with an n-bit hash function and k strings hashed is

1 - 2n! / (2kn (2n - k)!)

This is a fairly tricky quantity to work with directly, but we can get a decent approximation of this quantity using the expression

1 - e-k2/2n+1

So, to get (roughly) a probability p chance of a collision, we can solve to get

p ≈ 1 - e-k2/2n+1

1 - p ≈ e-k2/2n+1

ln(1 - p) ≈ -k2/2n+1

-ln(1 - p) ≈ k2/2n+1

-2n+1 ln(1 - p) ≈ k2

2(n+1)/2 √(-ln(1 - p)) ≈ k

As one last approximation, assume we're dealing with very small choices of p. Then ln(1 - p) ≈ -p, so we can rewrite this as

k ≈ 2(n+1)/2 √p

Notice that there's still a monster 2(n+1)/2 term here, so for a 256-bit hash that leading term is 2128.5, which is just enormous. For example, how many items must we see to get a 2-50 chance of a collision with a 256-bit hash? That would be approximately

2(256+1)/2 √2-50

= 2257/2 2-50/2

= 2207/2

= 2103.5.

So you'd need a staggeringly huge number of hashes to have a vanishingly small chance of getting a collision. Figure that 2103.5 is about 1031, which at one nanosecond per hash computed would take you longer than the length of the universe to compute. And after all that, you'd get a success probability of 2-50, which is about 10-15.

In fact, this precisely why we pick such large numbers of bits for our hashes! It makes it extremely unlikely for a collision to occur by chance.

(Note that the hash functions we have today aren't actually truly random functions, which is why people advise against using MD5, SHA1, and others that have had security weaknesses exposed.)

Hope this helps!

Share:
12,502
Dark Nebula
Author by

Dark Nebula

Updated on June 15, 2022

Comments

  • Dark Nebula
    Dark Nebula about 2 years

    I am looking for some precise math on the likelihood of collisions for MD5, SHA1, and SHA256 based on the birthday paradox.

    I am looking for something like a graph that says "If you have 10^8 keys, this is the probability. If you have 10^13 keys, this is the probability and so on"

    I have looked at tons of articles but I am having a tough time finding something that gives me this data. (Ideal option for me would be a formula or code that calculates this for any provided hash size)

  • Joachim Sauer
    Joachim Sauer almost 4 years
    Awesome answer! Even though the parenthetical note at the end is technically outside the scope of the question, I'd personally highlight it a bit more, because it means that judging a hashing algorithm purely by the size of its output is a bad idea, because they are known to not actually be perfect.
  • Dark Nebula
    Dark Nebula almost 4 years
    This is awesome and gives the exact math I was looking for, in a way that I can work it out easily for any custom values. I also want to call out the wikipedia article mentioned in another comment (en.wikipedia.org/wiki/Birthday_problem#Probability_table) which shows some general values for this.
  • foki
    foki over 3 years
    "The exact formula for the probability of getting a collision with an n-bit hash..." No, it is not the formula from the birthday paradox. The formula that you have here is exactly the opposite; what is the probability not to have any collisions.
  • templatetypedef
    templatetypedef over 3 years
    @foki Thanks for catching that! Turns out I had two separate errors here that cancelled out. First, there’s the one you pointed out. Second, the approximation I’d given wasn’t the approximation for the quantity I’d written, but for the quantity I’d intended to write. So fixing the original formula made all the rest of the logic still consistent and correct - at least, I think that’s now fixed. :-)
  • likern
    likern over 2 years
    You have error in last calculations
  • likern
    likern over 2 years
    207/2 is 103.5, not 153.5.