Is it safe to ignore the possibility of SHA collisions in practice?
Solution 1
The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.
If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.
A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.
In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.
Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.
Solution 2
The possibility of a collision does not depend on the size of the files, only on their number.
This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.
Basically, you can simply ignore the possibility.
Solution 3
First of all, it is not zero, but very close to zero.
The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dire and so you can ignore the collision possibility.
Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.
Related videos on Youtube
Hristo Hristov
Updated on May 31, 2020Comments
-
Hristo Hristov about 4 years
Let's say we have a billion unique images, one megabyte each. We calculate the SHA-256 hash for the contents of each file. The possibility of collision depends on:
- the number of files
the size of the single file
How far can we go ignoring this possibility, assuming it is zero?
-
mojuba over 13 yearsIt depends on what you are using the hash keys for. If it's some kind of file identification, then a collision may as well mean the files are identical and thus you need to compare the files too in cases of collision. I'd say it would be fairly safe to just compare the file sizes.
-
Hristo Hristov over 13 yearsYes, in this case, if you compare file sizes, the possibility drastically decreases. You can also use two hashing algorithms and concatenate the results. Then, the possibility of a collision of both at the same time decreases more. But, the question is, how much is "fairly" safe? Maybe we need a formula and numbers.
-
mojuba over 13 years@Hristo Hristov: if we assume that the hash key is a pseudo random number (which theoretically is correct) then one billion of 128-bit keys gives a collision probability of 2.9 * 10^-30. You can't even call it "miniscule", it's less than that ;)
-
Michael Borgwardt over 13 years@mojuba: even better, he's asking about a 256-bit hash.
-
snemarch over 13 yearsFWIW: the GIT version control system identifies files by their content SHA.
-
Mikko Rantalainen over 11 yearsYou asked about file of one megabyte each. In that case the size of the file does not matter for collisions. I believe that SHA-256 is designed to not do collisions if the files are shorter than the hash so one should not get collisions if the number of files is less than
2^256-1
and the maximum file size is less than 256 bits. -
hanshenrik about 5 yearsif you want an actual number, there is an 0.0000000000000000000000000000000000000000000000000000000004% chance of hitting the birthday paradox for sha256 with 1 billion files (according to the algorithm at stackoverflow.com/a/56498881/1067003 )
-
Ham about 3 yearsAgree with @mojuba , it is always good to consider whether data loss is acceptable in your application , if you don't want to run the risk of losing the file (e.g. from your customers) , then you will need to handle the collision and duplicate files, even the probability of collision is very close to zero, yes it extremely unlikely happenes, but it is NOT exactly zero, nobody knows when would collision happen to your application and when would attacker comes up with new methods that reduce significant amount of time to collide
-
sharptooth over 13 yearsI can't agree with the conclusion. Yes, no hardisks can store that number of files, but you IMO misinterpret the situation. It only takes two files to produce a collision. Although the possibility is very low it still can happen.
-
Michael Borgwardt over 13 years@sharptooth: no, I'm not misrepresenting the situation. The possibility of you and everyone you know dying from a road accident on the same day is very low, but it still can happen (and it is much higher than that of an SHA-256 collision). Yet you are ignoring that possibility.
-
sharptooth over 13 years@Michael Borgwardt: You're right, but I don't ignore that possibility - I take certain steps that make the possibility lower. Also in case of a road accident the head count very rarely exceeds say one hundred people, but in case of a nuclear plant disaster head count can easily exceed hundreds thousands of people.
-
Michael Borgwardt over 13 years@sharptooth: I was talking about separate, simultaneous road accidents of a few hundred specific people. You can't really take any stepts to make that lower. It would be pointless, as it's already bizarrely low. But still so much more likely than an SHA-256 collision that you can't even imagine how much. It's the same argument as Thomas made.
-
Hristo Hristov over 13 yearsThis is a very good answer, thanks! But, if in case of collision a nuclear power plant will explode, and it depends on you, will you take that risk? If you are completely right, then we can take the risk, because it is 45 orders of magnitude more probable the civilization to be destroyed. Right?
-
sharptooth over 13 years@Hristo Hristov: My guess is that all cases of such reasoning imply that if such disaster happens either noone survives or noone will take time to find who is at fault and punish that person.
-
sharptooth over 13 years@Michael Borgwardt: I gave this problem a thought and I still believe you're wrong. You don't take the number of program instances into account. Yes, maybe a chance of the program malfunctioning is very low, but if you have a gazillion copies - say on every plastic bag and on every tiny manufactured thing chances grow significantly. I don't say that every such defect is automatically a showstopper, but the number of deployed copies should be taken into account.
-
Michael Borgwardt over 13 years@sharptooth: No, the chances do not grow significantly, because the number is still absolutely dwarfed by the size of the SHA-256 hash space. This is the one thing you're not taking into account properly - all the factors have to be weighted by their actual magnitude, not equally. If you generated one billion hashes per second for every single person on Earth, and did that for one thousand years, you'd still have less than 1% chance of a collision.
-
sharptooth over 13 years@Michael Borgwardt: Yes, I understand that, but those factors need to be taken into account. Just saying that probability of one hash collision is very low regardless of the number of program copies and computation rate is not enough.
-
Roman Starkov over 13 years@Hristo I think yes, one would take that risk. A nuclear power plant already has a far higher chance of exploding due to other things, like mechanical failure, human error in building it or operator error while running it, and we are already taking those chances. If SHA-256 collisions were the only things causing nuclear incidents we would almost certainly have had exactly zero of them so far.
-
Dustin Oprea over 11 yearsfoxnews.com/science/2013/02/11/… I'd start thinking about SHA512.
-
Andreas Spindler almost 11 years...not by the # of copies, but the # of datasets all copies digest.
-
AaronLS almost 11 yearsI can now rest easy knowing that I'll likely be wiped out by an asteroid long before I live to experience an SHA-256 collision.
-
Waldheinz over 10 yearsThis is written nicely, but it's wrong by (by several orders of magnitude, so speaking). You're missing the implications of the birthday paradox. The Wikipedia article about it has a nice table to check the chances: en.wikipedia.org/wiki/Birthday_attack
-
Thomas Pornin over 10 yearsSorry, you are missing on the so-called "birthday paradox". Have a better look at the "nice table", it does not work the way you think. For the figures I give, in that table, it would be a value "10^9" in a column labelled "4.3*10^-60" and row "128 bits" (but the table does not go below 10^-18).
-
rxantos about 10 yearsWhat if the asteroid is aligned and push toward earth? The same way that a hacker will try any known posibility to increment their probabilities. Would you still trust it?
-
sudo almost 10 yearsAlso, what is the probability of more than one bit in RAM randomly flipping, avoiding ECC correction?
-
Thomas Pornin almost 10 yearsECC correction is avoided if two bits in the same word are flipped (and, depending on the bits, it may still be detected, though not corrected). In practice, "bad RAM" happens, though rarely when it is ECC, but it is not unheard of. An overheated bus controller is more probable, though.
-
Christophe over 8 yearsIf you don't check for the possibility of an uncorrected error on every fetch from memory or read from disk (which have a far higher probability than an SHA-256 collision), you may not fully understand probabilities.
-
robocat almost 8 yearsEven with the birthday paradox, you would only find a collision between two random files. Most usages of hashes require the attacker to craft their own file to replace yours (e.g. an executable) which is altogether a different bowl of atomic nuclei.
-
Dirk Bester almost 8 yearsThis is wrong, the number of copies of software running is irrelevant. The only thing that matters is the number of unique files that are processed and the birthday paradox is the math for the calculation.
-
Chris Middleton over 7 yearsI heard someone else mention that the likelihood of hardware failure - i.e. a bit flipping somewhere due to radiation, etc - is more likely than a hash collision, and hence, worrying about the hash collision is silly. Personally, I'd try to cover both cases, to be safe (the more safety in a nuclear power plant the better), but hash collisions would be probably very low on the list of potential dangers (assuming the hash space is large enough). However, this all assumes that there is not some hidden behavior in the hash function which causes collisions more frequently.
-
Volodymyr Boiko over 7 years
-
sharptooth over 7 years@GreenTree The thing you linked to is about deliberately crafting collisions.
-
hanshenrik about 5 yearsindeed it's not zero, it's 0.0000000000000000000000000000000000000000000000000000000004% chance of hitting the birthday paradox for sha256 with 1 billion files (according to the algorithm at stackoverflow.com/a/56498881/1067003 )
-
Carlo Wood over 4 yearsIf the answer is "a nuclear power plant will explode" then you can still totally ignore the probability that a collision will happen. Even when it kills all of humanity in the next second, it can still be ignored; because that chance is unimaginably much smaller than that that actually happens due to a astroid (see the accepted answer).
-
xdavidliu about 3 yearsthis answer is quite wrong, especially "If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility.", as many others here have explained.
-
ACEGL almost 3 years@Thomas Pornin: So given all SHA-256 hashes that have been produced globally since its inception, by all computer systems worldwide, can we assume that the number of identical hashes that were produced for different messages is still zero ?
-
DexieTheSheep over 2 yearsThis is when programming turns into philosophy haha
-
Jeremy Giaco over 2 yearsbasically, it would take roughly 15 trillion years at the current total btc hash rate (204xh/s) to get to the number of hashes stated in @sharptooth 's example of 10b people doing a billion hashes per second for a thousand years. Hopefully i mathed right, but it's a ridiculous number of operations. Granted, as the hashrate increases exponentially over time (assuming PoW is still a thing and btc sticks around), we will approach a rate at which sha256 will be tested to the extent that the example implied.