How to calculate the odds of a collision in hash algorithms?
Solution 1
This is a generalization of the Birthday problem.
Solution 2
That sounds a lot like the Birthday Paradox to me.
You should be able to just substitute the set of possible birthdays (365) with the possible hashes (50000) and run the same calculations they present there.
If you modify the python script presented in the article for your values:
def bp(n, d):
v = 1.0
for i in range(n):
v = v * (1 - float(i)/d)
return 1 - v
print bp(2, 50000)
You end up with the odds of collision on two numbers of 0.00002. Around 265 samples, you have around a 50% chance of having a collision.
Solution 3
First calculate the probability that there is not a collision:
hashes_picked = 100
single_collision_odds = 50000
# safe_combinations is number of ways to pick hashes that don't overlap
safe_combinations = factorial(single_collision_odds) / factorial(single_collision_odds - hashes_picked)
# all_combinations is total number of ways to pick hashes
all_combinations = single_collision_odds ** hashes_picked
collision_chance = (all_combinations - safe_combinations) / all_combinations
Solution 4
This is called the Birthday problem. To solve it, think instead about the odds of there being no collisions (call it pnc).
- pnc(1) = 1
- pnc(2) = 1 - pc(2)
- pnc(3) = pnc(2) * pnc(2) * pnc(2)
Comments
-
izb about 2 years
Say I have a hash algorithm, and it's nice and smooth (The odds of any one hash value coming up are the same as any other value).
Now say that I know that the odds of picking 2 hashes and there being a collision are (For arguments sake) 50000:1.
Now say I pick 100 hashes. How do I calculate the odds of a collision within that set of 100 values, given the odds of a collision in a set of 2?
What is the general solution to this, so that I can come up with a number of hash attempts after which the odds fall below some acceptable threshold? E.g. I can say things like "A batch of 49999 hash value creations has a high chance of collision".
-
CiscoIPPhone over 15 yearsA perfect hash algorithm is one where there are no collisions. I just thought I'd point it out. Apologies for being picky :-)
-
B Bulfin over 15 yearsAssuming the domain of the hash function is larger than the range, that's impossible. If it's not, why bother using a hash?
-
CiscoIPPhone over 15 yearsWell, you still get the speed advantage of using a hash function instead of searching. en.wikipedia.org/wiki/Perfect_hash_function
-
B Bulfin over 15 yearsOh, my apologies. I didn't realize "perfect hash" had an actual definition in math. I'm used to the software definition where we hash strings into 32 bit ints and the like.
-
Pete Kirkham over 15 yearsThe wikipedia article is the software definition. The term for a hash which has even likelihood of any value is 'smooth'.
-
Xeoncross over 10 years
-
-
Xeoncross over 10 yearsIs this a correct port? I got 5.9 as the probability.
-
B Bulfin over 10 yearsIt means power or exponent operator.
2 ** 3 == 8
.