How come MD5 hash values are not reversible?

38,554

Solution 1

The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.

If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?

Solution 2

If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.

Solution 3

Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.

Consider this function (in PHP notation, as the question):

function simple_hash($input) {
     return bin2hex(substr(str_pad($input, 16), 0, 16));
}

This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).

print simple_hash("stackoverflow.com");

This will output:

737461636b6f766572666c6f772e636f6d

This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.

But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).

The important assumptions for a cryptographic hash function are:

  • it is hard to find any string producing a given hash (preimage resistance)
  • it is hard to find any different string producing the same hash as a given string (second preimage resistance)
  • it is hard to find any pair of strings with the same hash (collision resistance)

Obviously my simple_hash function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)

There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical. There is not yet a preimage attack, but attacks will get better.

To answer the actual question:

What is it about these functions that makes the resulting strings impossible to retrace?

What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)

Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them). They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.

For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).

Solution 4

Cody Brocious's answer is the right one. Strictly speaking, you cannot "invert" a hash function because many strings are mapped to the same hash. Notice, however, that either finding one string that gets mapped to a given hash, or finding two strings that get mapped to the same hash (i.e. a collision), would be major breakthroughs for a cryptanalyst. The great difficulty of both these problems is the reason why good hash functions are useful in cryptography.

Solution 5

MD5 does not create a unique hash value; the goal of MD5 is to quickly produce a value that changes significantly based on a minor change to the source.

E.g.,

"hello" -> "1ab53"
"Hello" -> "993LB"
"ZR#!RELSIEKF" -> "1ab53"

(Obviously that's not actual MD5 encryption)

Most hashes (if not all) are also non-unique; rather, they're unique enough, so a collision is highly improbable, but still possible.

Share:
38,554

Related videos on Youtube

barfoon
Author by

barfoon

PHP & iOS developer based out of Ottawa.

Updated on February 14, 2021

Comments

  • barfoon
    barfoon over 3 years

    One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:

    If on my server, in PHP I produce:

    md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
    

    When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.

    Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?

    What is it about these functions that makes the resulting strings impossible to retrace?

    • Gab Royer
      Gab Royer almost 15 years
      A simple example of non reversible value for example is modulo. For example 10 % 3 = 1, but you can't reverse the 1 to 10 as it could also be 4
    • Dan Diplo
      Dan Diplo almost 15 years
      If you could reconstruct the data you'd have the most efficient lossless compression algorithm ever :)
  • Federico A. Ramponi
    Federico A. Ramponi over 15 years
    That is to say, neither number --> remainder nor string --> md5 are "injective functions".
  • barfoon
    barfoon over 15 years
    Ahh yes. I enjoyed reading Jeff's post on Hash Tables (codinghorror.com/blog/archives/000949.html), and this thread has helped in the understanding of the concept.
  • Mihai Limbășan
    Mihai Limbășan over 15 years
    Federico, surely you mean that neither are bijective functions? They are both injective.
  • biozinc
    biozinc over 15 years
    moocha: Injective means 1 to 1. The MD5 is certainly not 1 to 1, as the domain is larger than the range. Another point worth noting is that given a MD5 checksum, it's very hard to find even one string that hashes to it. Might be worth adding to the answer for clarification.
  • Colin Pickard
    Colin Pickard about 15 years
    think how quick it would be to download linux distros if you could just get the md5 instead :)
  • Ishbir
    Ishbir almost 15 years
    @Colin Pickard: we wouldn't be downloading linux distros any more, we would be writing them down. :)
  • Rob Sobers
    Rob Sobers almost 15 years
    You state that "an infinite number of strings will generate the same output." Isn't that contradictory to the statement in the question that these types of functions "generate a hash value that is unique." In your random number example produces a value that is almost never unique. I'm confused.
  • Serafina Brocious
    Serafina Brocious almost 15 years
    It's impossible to have a hash function that generates unique values. You're mapping an infinite number of values into a finite number of values, which guarantees collisions.
  • surfer01
    surfer01 about 14 years
    There are 365 possible birthday values, which is between 2^8 and 2^9. A 128-bit hash has 2^128 possible values -- 2^120 times as many. Yes, collisions are more likely than you might intuit, but they are still astronomically unlikely.
  • surfer01
    surfer01 about 14 years
    Ah, yes, the hash of a commonplace 7-letter word. Now use it to figure out this 11-word song lyric with whitespace and punctuation: 9f2c08d4e6158bd4854b15be50c8daa8. See you in several millenia.
  • scherand
    scherand about 14 years
    6fba2bbab8a8366309bf67c7df12c622? Hint: it might be the OEM version of a specific version of Mac OS X!
  • Babar
    Babar about 14 years
    @Tim Keating, @scherand: Just pointing out the weakness of hash algorithms, because hash of a string is always same we don't necessarily need to crack the algorithm to figure out the actual string.
  • surfer01
    surfer01 about 14 years
    But that's not what you said. You said that hashes are "precomputed for all possible strings and stored for easy access" which is patently false (the set of "all possible strings" is infinite... and even the set of "all plausible strings" is really, really large). IMHO this misrepresents how easy it is to do a dictionary attack against a reasonable passphrase.
  • Mike Pelley
    Mike Pelley almost 14 years
    I'd suggest that your answer does not address the key point. As biozinc mentioned, what's important for a secure password hash is that you can't find any input that creates the output, not that you can't find the original input. On that note, MD5 is not necessarily as secure as it could be (en.wikipedia.org/wiki/MD5#Collision_vulnerabilities).
  • Paŭlo Ebermann
    Paŭlo Ebermann over 12 years
    The same non-injectivity property would be valid for substr($input." ", 0, 16) (with 16 spaces), but this certainly is not a good hash function (here the preimage attacks are trivial).
  • Paŭlo Ebermann
    Paŭlo Ebermann over 12 years
    You'll need about 2^64 different values to have a good chance at a hash collision. Still quite some.
  • Paŭlo Ebermann
    Paŭlo Ebermann over 12 years
    While image-resizing is a lossy process, it is still quite easy to produce an image in the original 5000×5000 size which will (when applying the shrinking function again) reduce to the same 32×32 image. Finding such a preimage should be hard for a good hash function.
  • Paŭlo Ebermann
    Paŭlo Ebermann over 12 years
    Yes, I know that this is a quite late answer, but the accepted answer should not be let standing this way.
  • Dave L.
    Dave L. about 12 years
    Does MD5 truly operate on an infinitely long input value? Or is it just that there are infinitely many possible input values (each of finite length)?
  • wwaawaw
    wwaawaw over 11 years
    Right, but based on the information in your answer, it would seem that hash functions are inherently vulnerable to collisions. I realize that MD6 is notoriously so, but ones generally understood to be secure supposedly aren't. How do you make the leap from a mod 2 operation to constructing an actual hash function that doesn't have a 50% chance of two randomly given inputs colliding?
  • wwaawaw
    wwaawaw over 11 years
    @DaveL., I think that once you cap the length, you automatically cap the number of permutations that could conceivably fit within that length constraint. Think about it: if you have X bits as a maximum length, that represents a number expressed in base 2. If you forget that its binary for a second and just think about decimal ("normal") numbers, and say that you have 8 digits in which to express a number, the highest number that can be expressed is going to be 99999999 (youtu.be/0Ba9HAKxw9M), and the lowest would be 00000000...
  • wwaawaw
    wwaawaw over 11 years
    @DaveL. There are then only 100000000 numbers that can ever be expressed in that many places, and the same holds true when you're dealing with binary.
  • Dave L.
    Dave L. over 11 years
    @adlwalrus I don't think you understood this distinction. For example, there are an infinite number of integers, but every integer is of finite length, or takes a finite number of bits to represent. Similarly, I would expect that MD5 can operate on an infinite number of possible input values, but not on an input value of infinite length (it would never terminate or produce output).
  • wwaawaw
    wwaawaw over 11 years
    Beiing a total cryptonoob and all, I'm not really a qualified person to answer the question, and I don't know the answer for sure, but I would be willing to bet that there is no cap on input value length, and I'll tell you why. There's a cap on input length for RSA asymmetric encryption, but it's overcome with a hack by using symmetric and then asymmetrically encrypting the symmetric key, which is of finite length. If there were any such limitation, I'm sure that the developers of any hash algorithm would find a similarly clever coping mechanism for such cases, like breaking it up into...
  • wwaawaw
    wwaawaw over 11 years
    ...blocks, and then concatenating and hashing the hashes of those blocks until it's within the input length limit. Given that I'm not a cryptographer and I could think of that on the spot, I'm sure the hash devs could have thought of something better and I can't really see any reason why they might not use it, so like I said I'd be willing to bet that they did use something to the effect of what I came up with.
  • Serafina Brocious
    Serafina Brocious over 11 years
    Hash functions are inherently vulnerable to collisions; good hash functions are simply more evenly distributed and thus their collision potential is closer to optimal. No hash is invulnerable to collisions.
  • David Schwartz
    David Schwartz over 11 years
    @DaveL.: He's using "infinite" in an informal way to mean "unlimited".
  • Olathe
    Olathe about 11 years
    This answer is incorrect and goes very far in conflating a function that nearly is a truly-irreversible function like f(x) = 1 with a hash function used for, say, passwords. If you have several candidates for passwords like password and 8ZDEyCJ!WM and tXhN7TW&zq, it's not exactly hard to figure out which is the original. This isn't just a theoretical thing: such statistical regularities in human language are used to crack all kinds of cryptographic things.
  • Serafina Brocious
    Serafina Brocious about 11 years
    Olathe: By that regard, f(x) = x % 5 is reversible. What you're saying is that, given f(x), y=f(x'), and a set of possible x's, you can find the x' that matches y; that's completely true. But it doesn't make the function reversible in any way. Hashes -- even broken hashes -- are irreversible. That doesn't mean it isn't possible to bruteforce f() or find a preimage attack against it.
  • Paŭlo Ebermann
    Paŭlo Ebermann about 11 years
    @CodyBrocious The "goodness" of a hash function is not mainly measured by its evenness of distribution (a simple remainder function might do better), but by the hardness (in terms of computing time) of finding collisions, preimages, or other attacks.
  • Paŭlo Ebermann
    Paŭlo Ebermann about 11 years
    Where do you get your Most ciphers simply XOR the data with the keystream knowledge? This is true for stream ciphers, but there are block ciphers, too, and they don't work this way.
  • Sandeep Datta
    Sandeep Datta almost 9 years
    I think your criticisms have some merit but you have failed to answer the actual question "What is it about these functions that makes the resulting strings impossible to retrace?" Your answer focuses on the qualities a cryptographic hash should have but has zero explanation of how they are implemented by md5. You could state the exact algorithm for computing MD5 sums here to show how it is not reversible but the other answers do provide a simpler explanation without going into the nitty-gritties.
  • Sandeep Datta
    Sandeep Datta almost 9 years
    (cont...) 2. These explanations use "Math" to show a fundamental problem due to which such operations loose information and become irreversible.
  • Paŭlo Ebermann
    Paŭlo Ebermann over 8 years
    @SandeepDatta I added some paragraphs about this.
  • Shane
    Shane over 7 years
    The real value of "hashed values" is not to provide human-unreadable placeholders. If 'password1' hashed to 'newval', does that still not hide the value in a similar way, although the hash is readable and meaningful? Furthermore, passwords are a BAD example, because they should NEVER be hashed. Assuming the attacker had write access to said database, that is definitely a possibility. However it seems you are merely discarding the proper use for such hashing functions, one example is outlined in the many answers above - message integrity. It is the reason I'm on this thread today, actually.
  • Dave L.
    Dave L. about 7 years
    If you were able to generate an input that produced the given MD5 hash value in any reasonably efficient fashion, that would be a big deal to the crypto community and should be published. That's completely independent of whether a particular input was salted.
  • Justin J Stark
    Justin J Stark over 5 years
    While other answer in this thread are more technically correct, this answer is the most useful. The non-injective function f(x)=1 is non-reversible but uninteresting. The usefulness of hashing lies in preimage resistance where it is difficult to find any input yielding a specific output.