Is it possible to get identical SHA1 hash?

57,087

Solution 1

What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.

Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.

(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)

Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power)(Update: as of early 2017, somebody did compute a SHA-1 collision with the above-mentioned method, and it worked exactly as predicted). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.

As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).

Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.

Solution 2

Yes it is possible because of the pigeon hole principle.

Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.

However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.

Solution 3

git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014. Obviously, the SHA1 algorithm is magic. I think it's a good bet that collisions don't exist for strings of your length, as they would have been discovered by now. However, if you don't trust magic and are not a betting man, you could generate random strings and associate them with your IDs in your DB. But if you do use SHA1 hashes and become the first to discover a collision, you can just change your system to use random strings at that time, retaining the SHA1 hashes as the "random" strings for legacy IDs.

Solution 4

A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.

For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.

Solution 5

What you are talking about is called a collision. Here is an article about SHA1 collisions: http://www.rsa.com/rsalabs/node.asp?id=2927

Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).

Share:
57,087

Related videos on Youtube

Andriy Drozdyuk
Author by

Andriy Drozdyuk

I like AI and in particular Reinforcement Learning. I used to like Erlang, Scala, python, Akka and Zeromq! Did my undergrad in Computer Science at University of Toronto. Did my masters in Data Mining at University of New Brunswick. Working as a programmer at NRC and doing PhD part time at Carleton University.

Updated on July 05, 2022

Comments

  • Andriy Drozdyuk
    Andriy Drozdyuk almost 2 years

    Given two different strings S1 and S2 (S1 != S2) is it possible that:

    SHA1(S1) == SHA1(S2)
    

    is True?

    1. If yes - with what probability?
    2. If not - why not?
    3. Is there a upper bound on the length of a input string, for which the probability of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?

    The goal I am trying to achieve is to hash some sensitive ID string (possibly joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).

    Example:

    Resource ID: X123
    Parent ID: P123
    

    I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".

    Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.

    • kennytm
      kennytm over 14 years
      Since the length of SHA-1 is 160 bits the probability is always higher than 2^-160 against a fixed string.
    • uKolka
      uKolka over 14 years
      Assuming a brute force attack, and uniform distribution
    • Sky Sanders
      Sky Sanders over 14 years
      Yes, collisions are possible. But you are more likely to win PowerBall. Much more likely. I cannot see how this could be a concern unless you are using hash as a key, which, while quite an attractive proposition, is not advisable.
    • Andriy Drozdyuk
      Andriy Drozdyuk over 14 years
      Indeed, i am thinking of using hash as a id for my resource.
    • tony9099
      tony9099 over 9 years
      the idea looks nice mate, but why? you can though.
    • Patrick Bassut
      Patrick Bassut over 7 years
      @tony9099 not just an idea now, is it
    • tony9099
      tony9099 over 7 years
      sha1 is officially broken now. Engineers from google did it. have a look security.googleblog.com/2017/02/…
    • 1615903
      1615903 over 7 years
      Nominated for reopening since there now exists a non-bruteforce attack to create identical sha1 hashes.
    • Mikko Rantalainen
      Mikko Rantalainen over 2 years
      For an example of a situation where SHA-1 results idential result for inputs S1 and S2 that are different, see shattered.io – generating that hash collision required combination of 6500 CPU years and 100 GPU years of work so generating additional collisions is not (yet) considered easy. Google paid for that project to demonstrate in practice that SHA-1 must not be used for cryptography purposes anymore. It's still fine for checksumming to e.g. verify if file was not corrupted during transfer.
  • AaronLS
    AaronLS over 14 years
    I was just going to mention the pigeon hole principle when I edited my answer. LOL I am going to take it back out cause it looks like I copied you.
  • Nick Johnson
    Nick Johnson over 14 years
    The fact that the best attack takes 2^63 work doesn't mean that the chance of any two hashes colliding is 1 in 2^63 - all known attacks require constructing two colliding messages, not finding a preimage.
  • Andriy Drozdyuk
    Andriy Drozdyuk over 14 years
    So... what you are saying is that for strings with length more than 80 bits the probably of at least one collision exists? What if I take 40 bit strings, or 20 bit strings? I know it is silly - but is there a function that will ensure that there are NO duplicates? I guess this is going to be just a regular mapping function, instead of a hash (i.e. substitute digit 1 with A etc..) - which makes my use of SHA1 kind of pointless for this purpose.
  • President James K. Polk
    President James K. Polk over 14 years
    What purpose? You haven't stated a purpose. There are cryptographic primitives that are permutations, such as block ciphers in ECB mode and RSA.
  • Andriy Drozdyuk
    Andriy Drozdyuk over 14 years
    Sorry - added goal (purpose) to the question. basically hiding sensitive information.
  • Thomas Pornin
    Thomas Pornin over 14 years
    @drozzy: if you take 40-bit input strings, then there is a very low probability that two of them hash to the same value. But you cannot know for sure without trying them exhaustively (with 40-bit strings, this is doable, with a big PC and a few days or weeks). Of course, if you have no collision, then what you really have is a mapping from all your strings to distinct 160-bit values. For "hiding sensitive information", the keyword is "confidentiality" and you should research encryption instead of hashing.
  • Andriy Drozdyuk
    Andriy Drozdyuk over 14 years
  • andora
    andora about 14 years
    I find it helps to realize that the odds of winning most national public lotteries is astoundingly small, yet the prize is often paid out each week or within a few weeks, so no matter how small the odds, they will occur. 2^-x may be a very small number, but there in no certainty either way, that collision will or won't occur. Or to be exact: 2^-x is neither one nor zero. Similarly, although the chances are slight, there is nothing to stop hash(1) matching hash(2) no matter what you do to constrain it.
  • Tim Seguine
    Tim Seguine over 9 years
    At the time of writing it is still an open theoretical question if one-way functions even exist. If they don't, then every hash function computible in polynomial time is vulnerable to collision attacks.
  • bodgesoc
    bodgesoc over 8 years
    One thing that I don't think has been mentioned here is that whilst there are a truly astonishing number of byte streams that will collide with any given byte stream very, very few of those will be valid names (as in your example) or valid code (as in the git example). So, the probability is 2^(-160) of a collision in an truly random input-space, but the input-space is generally not random. I am not confident enough in my statistical skills to state authoritatively that this makes a collision less likely, but I think it probably does, and to a very significant extent.
  • Dave L.
    Dave L. over 7 years
    As of today there are now published collisions for SHA1: shattered.it
  • tony9099
    tony9099 over 7 years