Is it safe to ignore the possibility of SHA collisions in practice?

97,521

Solution 1

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.

If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.

A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.

In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.

Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.

Solution 2

The possibility of a collision does not depend on the size of the files, only on their number.

This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.

Basically, you can simply ignore the possibility.

Solution 3

First of all, it is not zero, but very close to zero.

The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dire and so you can ignore the collision possibility.

Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.

Share:
97,521

Related videos on Youtube

Hristo Hristov
Author by

Hristo Hristov

Updated on May 31, 2020

Comments

  • Hristo Hristov
    Hristo Hristov about 4 years

    Let's say we have a billion unique images, one megabyte each. We calculate the SHA-256 hash for the contents of each file. The possibility of collision depends on:

    • the number of files
    • the size of the single file

    How far can we go ignoring this possibility, assuming it is zero?

    • mojuba
      mojuba over 13 years
      It depends on what you are using the hash keys for. If it's some kind of file identification, then a collision may as well mean the files are identical and thus you need to compare the files too in cases of collision. I'd say it would be fairly safe to just compare the file sizes.
    • Hristo Hristov
      Hristo Hristov over 13 years
      Yes, in this case, if you compare file sizes, the possibility drastically decreases. You can also use two hashing algorithms and concatenate the results. Then, the possibility of a collision of both at the same time decreases more. But, the question is, how much is "fairly" safe? Maybe we need a formula and numbers.
    • mojuba
      mojuba over 13 years
      @Hristo Hristov: if we assume that the hash key is a pseudo random number (which theoretically is correct) then one billion of 128-bit keys gives a collision probability of 2.9 * 10^-30. You can't even call it "miniscule", it's less than that ;)
    • Michael Borgwardt
      Michael Borgwardt over 13 years
      @mojuba: even better, he's asking about a 256-bit hash.
    • snemarch
      snemarch over 13 years
      FWIW: the GIT version control system identifies files by their content SHA.
    • Mikko Rantalainen
      Mikko Rantalainen over 11 years
      You asked about file of one megabyte each. In that case the size of the file does not matter for collisions. I believe that SHA-256 is designed to not do collisions if the files are shorter than the hash so one should not get collisions if the number of files is less than 2^256-1 and the maximum file size is less than 256 bits.
    • hanshenrik
      hanshenrik about 5 years
      if you want an actual number, there is an 0.0000000000000000000000000000000000000000000000000000000004‌​% chance of hitting the birthday paradox for sha256 with 1 billion files (according to the algorithm at stackoverflow.com/a/56498881/1067003 )
    • Ham
      Ham about 3 years
      Agree with @mojuba , it is always good to consider whether data loss is acceptable in your application , if you don't want to run the risk of losing the file (e.g. from your customers) , then you will need to handle the collision and duplicate files, even the probability of collision is very close to zero, yes it extremely unlikely happenes, but it is NOT exactly zero, nobody knows when would collision happen to your application and when would attacker comes up with new methods that reduce significant amount of time to collide
  • sharptooth
    sharptooth over 13 years
    I can't agree with the conclusion. Yes, no hardisks can store that number of files, but you IMO misinterpret the situation. It only takes two files to produce a collision. Although the possibility is very low it still can happen.
  • Michael Borgwardt
    Michael Borgwardt over 13 years
    @sharptooth: no, I'm not misrepresenting the situation. The possibility of you and everyone you know dying from a road accident on the same day is very low, but it still can happen (and it is much higher than that of an SHA-256 collision). Yet you are ignoring that possibility.
  • sharptooth
    sharptooth over 13 years
    @Michael Borgwardt: You're right, but I don't ignore that possibility - I take certain steps that make the possibility lower. Also in case of a road accident the head count very rarely exceeds say one hundred people, but in case of a nuclear plant disaster head count can easily exceed hundreds thousands of people.
  • Michael Borgwardt
    Michael Borgwardt over 13 years
    @sharptooth: I was talking about separate, simultaneous road accidents of a few hundred specific people. You can't really take any stepts to make that lower. It would be pointless, as it's already bizarrely low. But still so much more likely than an SHA-256 collision that you can't even imagine how much. It's the same argument as Thomas made.
  • Hristo Hristov
    Hristo Hristov over 13 years
    This is a very good answer, thanks! But, if in case of collision a nuclear power plant will explode, and it depends on you, will you take that risk? If you are completely right, then we can take the risk, because it is 45 orders of magnitude more probable the civilization to be destroyed. Right?
  • sharptooth
    sharptooth over 13 years
    @Hristo Hristov: My guess is that all cases of such reasoning imply that if such disaster happens either noone survives or noone will take time to find who is at fault and punish that person.
  • sharptooth
    sharptooth over 13 years
    @Michael Borgwardt: I gave this problem a thought and I still believe you're wrong. You don't take the number of program instances into account. Yes, maybe a chance of the program malfunctioning is very low, but if you have a gazillion copies - say on every plastic bag and on every tiny manufactured thing chances grow significantly. I don't say that every such defect is automatically a showstopper, but the number of deployed copies should be taken into account.
  • Michael Borgwardt
    Michael Borgwardt over 13 years
    @sharptooth: No, the chances do not grow significantly, because the number is still absolutely dwarfed by the size of the SHA-256 hash space. This is the one thing you're not taking into account properly - all the factors have to be weighted by their actual magnitude, not equally. If you generated one billion hashes per second for every single person on Earth, and did that for one thousand years, you'd still have less than 1% chance of a collision.
  • sharptooth
    sharptooth over 13 years
    @Michael Borgwardt: Yes, I understand that, but those factors need to be taken into account. Just saying that probability of one hash collision is very low regardless of the number of program copies and computation rate is not enough.
  • Roman Starkov
    Roman Starkov over 13 years
    @Hristo I think yes, one would take that risk. A nuclear power plant already has a far higher chance of exploding due to other things, like mechanical failure, human error in building it or operator error while running it, and we are already taking those chances. If SHA-256 collisions were the only things causing nuclear incidents we would almost certainly have had exactly zero of them so far.
  • Dustin Oprea
    Dustin Oprea over 11 years
    foxnews.com/science/2013/02/11/… I'd start thinking about SHA512.
  • Andreas Spindler
    Andreas Spindler almost 11 years
    ...not by the # of copies, but the # of datasets all copies digest.
  • AaronLS
    AaronLS almost 11 years
    I can now rest easy knowing that I'll likely be wiped out by an asteroid long before I live to experience an SHA-256 collision.
  • Waldheinz
    Waldheinz over 10 years
    This is written nicely, but it's wrong by (by several orders of magnitude, so speaking). You're missing the implications of the birthday paradox. The Wikipedia article about it has a nice table to check the chances: en.wikipedia.org/wiki/Birthday_attack
  • Thomas Pornin
    Thomas Pornin over 10 years
    Sorry, you are missing on the so-called "birthday paradox". Have a better look at the "nice table", it does not work the way you think. For the figures I give, in that table, it would be a value "10^9" in a column labelled "4.3*10^-60" and row "128 bits" (but the table does not go below 10^-18).
  • rxantos
    rxantos about 10 years
    What if the asteroid is aligned and push toward earth? The same way that a hacker will try any known posibility to increment their probabilities. Would you still trust it?
  • sudo
    sudo almost 10 years
    Also, what is the probability of more than one bit in RAM randomly flipping, avoiding ECC correction?
  • Thomas Pornin
    Thomas Pornin almost 10 years
    ECC correction is avoided if two bits in the same word are flipped (and, depending on the bits, it may still be detected, though not corrected). In practice, "bad RAM" happens, though rarely when it is ECC, but it is not unheard of. An overheated bus controller is more probable, though.
  • Christophe
    Christophe over 8 years
    If you don't check for the possibility of an uncorrected error on every fetch from memory or read from disk (which have a far higher probability than an SHA-256 collision), you may not fully understand probabilities.
  • robocat
    robocat almost 8 years
    Even with the birthday paradox, you would only find a collision between two random files. Most usages of hashes require the attacker to craft their own file to replace yours (e.g. an executable) which is altogether a different bowl of atomic nuclei.
  • Dirk Bester
    Dirk Bester almost 8 years
    This is wrong, the number of copies of software running is irrelevant. The only thing that matters is the number of unique files that are processed and the birthday paradox is the math for the calculation.
  • Chris Middleton
    Chris Middleton over 7 years
    I heard someone else mention that the likelihood of hardware failure - i.e. a bit flipping somewhere due to radiation, etc - is more likely than a hash collision, and hence, worrying about the hash collision is silly. Personally, I'd try to cover both cases, to be safe (the more safety in a nuclear power plant the better), but hash collisions would be probably very low on the list of potential dangers (assuming the hash space is large enough). However, this all assumes that there is not some hidden behavior in the hash function which causes collisions more frequently.
  • Volodymyr Boiko
    Volodymyr Boiko over 7 years
  • sharptooth
    sharptooth over 7 years
    @GreenTree The thing you linked to is about deliberately crafting collisions.
  • hanshenrik
    hanshenrik about 5 years
    indeed it's not zero, it's 0.0000000000000000000000000000000000000000000000000000000004‌​% chance of hitting the birthday paradox for sha256 with 1 billion files (according to the algorithm at stackoverflow.com/a/56498881/1067003 )
  • Carlo Wood
    Carlo Wood over 4 years
    If the answer is "a nuclear power plant will explode" then you can still totally ignore the probability that a collision will happen. Even when it kills all of humanity in the next second, it can still be ignored; because that chance is unimaginably much smaller than that that actually happens due to a astroid (see the accepted answer).
  • xdavidliu
    xdavidliu about 3 years
    this answer is quite wrong, especially "If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility.", as many others here have explained.
  • ACEGL
    ACEGL almost 3 years
    @Thomas Pornin: So given all SHA-256 hashes that have been produced globally since its inception, by all computer systems worldwide, can we assume that the number of identical hashes that were produced for different messages is still zero ?
  • DexieTheSheep
    DexieTheSheep over 2 years
    This is when programming turns into philosophy haha
  • Jeremy Giaco
    Jeremy Giaco over 2 years
    basically, it would take roughly 15 trillion years at the current total btc hash rate (204xh/s) to get to the number of hashes stated in @sharptooth 's example of 10b people doing a billion hashes per second for a thousand years. Hopefully i mathed right, but it's a ridiculous number of operations. Granted, as the hashrate increases exponentially over time (assuming PoW is still a thing and btc sticks around), we will approach a rate at which sha256 will be tested to the extent that the example implied.