How to calculate the odds of a collision in hash algorithms?

12,316

Solution 1

This is a generalization of the Birthday problem.

Solution 2

That sounds a lot like the Birthday Paradox to me.

You should be able to just substitute the set of possible birthdays (365) with the possible hashes (50000) and run the same calculations they present there.

If you modify the python script presented in the article for your values:

 def bp(n, d):
    v = 1.0
    for i in range(n):

         v = v * (1 - float(i)/d)
    return 1 - v

 print bp(2, 50000)

You end up with the odds of collision on two numbers of 0.00002. Around 265 samples, you have around a 50% chance of having a collision.

Solution 3

First calculate the probability that there is not a collision:

hashes_picked = 100
single_collision_odds = 50000

# safe_combinations is number of ways to pick hashes that don't overlap
safe_combinations = factorial(single_collision_odds) / factorial(single_collision_odds - hashes_picked)

# all_combinations is total number of ways to pick hashes
all_combinations = single_collision_odds ** hashes_picked   

collision_chance = (all_combinations - safe_combinations) / all_combinations

Solution 4

This is called the Birthday problem. To solve it, think instead about the odds of there being no collisions (call it pnc).

  • pnc(1) = 1
  • pnc(2) = 1 - pc(2)
  • pnc(3) = pnc(2) * pnc(2) * pnc(2)
Share:
12,316
izb
Author by

izb

Twitter: http://twitter.com/izb

Updated on June 18, 2022

Comments

  • izb
    izb about 2 years

    Say I have a hash algorithm, and it's nice and smooth (The odds of any one hash value coming up are the same as any other value).

    Now say that I know that the odds of picking 2 hashes and there being a collision are (For arguments sake) 50000:1.

    Now say I pick 100 hashes. How do I calculate the odds of a collision within that set of 100 values, given the odds of a collision in a set of 2?

    What is the general solution to this, so that I can come up with a number of hash attempts after which the odds fall below some acceptable threshold? E.g. I can say things like "A batch of 49999 hash value creations has a high chance of collision".

    • CiscoIPPhone
      CiscoIPPhone over 15 years
      A perfect hash algorithm is one where there are no collisions. I just thought I'd point it out. Apologies for being picky :-)
    • B Bulfin
      B Bulfin over 15 years
      Assuming the domain of the hash function is larger than the range, that's impossible. If it's not, why bother using a hash?
    • CiscoIPPhone
      CiscoIPPhone over 15 years
      Well, you still get the speed advantage of using a hash function instead of searching. en.wikipedia.org/wiki/Perfect_hash_function
    • B Bulfin
      B Bulfin over 15 years
      Oh, my apologies. I didn't realize "perfect hash" had an actual definition in math. I'm used to the software definition where we hash strings into 32 bit ints and the like.
    • Pete Kirkham
      Pete Kirkham over 15 years
      The wikipedia article is the software definition. The term for a hash which has even likelihood of any value is 'smooth'.
    • Xeoncross
      Xeoncross over 10 years
  • Xeoncross
    Xeoncross over 10 years
    Is this a correct port? I got 5.9 as the probability.
  • B Bulfin
    B Bulfin over 10 years
    It means power or exponent operator. 2 ** 3 == 8.