Why is it not possible to reverse a cryptographic hash?

hash cryptography md5 cryptographic-hash-function reversing

18,846

Solution 1

MD5 is designed to be cryptographically irreversible. In this case, the most important property is that it is computationally unfeasible to find the reverse of a hash, but it is easy to find the hash of any data. For example, let's think about just operating on numbers (binary files after all, could be interpreted as just a very long number).

Let's say we have the number "7", and we want to take the hash of it. Perhaps the first thing we try as our hash function is "multiply by two". As we'll see, this is not a very good hash function, but we'll try it, to illustrate a point. In this case, the hash of the number will be "14". That was pretty easy to calculate. But now, if we look at how hard it is to reverse it, we find that it is also just as easy! Given any hash, we can just divide it by two to get the original number! This is not a good hash, because the whole point of a hash is that it is much harder to calculate the inverse than it is to calculate the hash (this is the most important property in at least some contexts).

Now, let's try another hash. For this one, I'm going to have to introduce the idea of clock arithmetic. On a clock, there aren't an infinite amount of number. In fact, it just goes from 0 to 11 (remember, 0 and 12 are the same on a clock). So if you "add one" to 11, you just get zero. You can extend the ideas of multiplication, addition, and exponentiation to a clock. For example, 8+7=15, but 15 on a clock is really just 3! So on a clock, you would say 8+7=3! 6*6=36, but on a clock, 36=0! so 6*6=0! Now, for the concept of powers, you can do the same thing. 2^4=16, but 16 is just 4. So 2^4=4! Now, here's how it ties into hashing. How about we try the hash function f(x)=5^x, but with clock arithmetic. As you'll see, this leads to some interesting results. Let's try taking the hash of 7 as before.

We see that 5^7=78125 but on a clock, that's just 5 (if you do the math, you see that we've wrapped around the clock 6510 times). So we get f(7)=5. Now, the question is, if I told you that the hash of my number was 5, would you be able to figure out that my number was 7? Well, it's actually very hard to calculate the reverse of this function in the general case. People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward. (EDIT: Nemo has pointed out that this in fact has not been "proven"; in fact, the only guarantee you get is that a lot of smart people have tried a long time to find an easy way to do so, and none of them have succeeded.) The problem of reversing this operation is called the "Discrete Logarithm Problem". Look it up for more in depth coverage. This is at least the beginning of a good hash function.

With real world hash functions, the idea is basically the same: You find some function that is hard to reverse. People much smarter than me have engineered MD5 and other hashes to make them provably hard to reverse.

Now, perhaps earlier the thought has occurred to you: "it would be easy to calculate the inverse! I'd just take the hash of every number until I found the one that matched!" Now, for the case where the numbers are all less than twelve, this would be feasible. But for the analog of a real-world hash function, imagine all the numbers involved are huge. The idea is that it is still relatively easy to calculate the hash function for these large numbers, but to search through all possible inputs becomes harder much quicker. But what you've stumbled upon is the still a very important idea though: searching through the input space for an input which will give a matching output. Rainbow tables are a more complex variation on the idea, which use precomputed tables of input-output pairs in smart ways in order to make it possible to quickly search through a large number of possible inputs.

Now let's say that you are using a hash function to store passwords on your computer. The idea is this: The computer just stores the hash of the correct password. When a user tries to login, you compare the hash of the input password to the hash of the correct password. If they match, you assume the user has the correct password. The reason this is advantageous is because if someone steals your computer, they still don't have access to your password, just the hash of it. Because the hash function was designed by smart people to be hard to take the reverse of, they can't easily retrieve your password from it.

An attacker's best bet is a bruteforce attack, where they try a bunch of passwords. Just like you might try the numbers less that 12 in the previous problem, an attacker might try all the passwords just composed of numbers and letters less than 7 characters long, or all words which show up in the dictionary. The important thing here is that he can't try all possible passwords, because there are way too many possible 16 character passwords, for example, to ever test. So the point is that an attacker has to restrict the possible passwords he tests, otherwise he will never even check a small percentage of them.

Now, as for a salt, the idea is this: What if two users had the same password? They would have the same hash. If you think about it, the attacker doesn't really have to crack every users password individually. He simply goes through every possible input password, and compares the hash to all the hashes. If it matches one of them, then he has found a new password. What we'd really like to force him to do is calculate a new hash for every user+password combination he wants to check. That's the idea of a salt, is that you make the hash function be slightly different for every user, so he can't reuse a single set of precomputed values for all users. The most straightforward way to do this is to tack on some random string to each user's password before you take the hash, where the random string is different for each user. So, for example, if my password is "shittypassword", my hash might show up as MD5("6n93nshittypassword") and if your password is "shittypassword", your hash might show up as MD5("fa9elshittypassword"). This little bit "fa9el" is called the "salt", and it's different for every user. For example, my salt is "6n93n". Now, this little bit which is tacked on to your password is just stored on your computer as well. When you try to login with the password X, the computer can just calculate MD5("fa9el"+X) and see if it matches the stored hash.

So the basic mechanics of logging in remain unchanged, but for an attacker, they are now faced with a more daunting challenge: rather than a list of MD5 hashes, they are faced with a list of MD5 sums and salts. They essentially have two options:

They can ignore the fact that the hashes are salted, and try to crack the passwords with their lookup table as is. However, the chances that they'll actually crack a password are much reduced. For example, even if "shittypassword" is on their list of inputs to check, most likely "fa9elshittypassword" isn't. In order to get even a small percentage of the probability of cracking a password that they had before, they'll need to test orders of magnitude more possible passwords.
They can recalculate the hashes on a per-user basis. So rather than calculating MD5(passwordguess), for each user X, they calculate MD5( Salt_of_user_X + passwordguess). Not only does this force them to calculate a new hash for each user they want to crack, but also most importantly, it prevents them from being able to use precalculated tables (like rainbow table, for example), because they can't know what Salt_of_user_X is before hand, so they can't precalculate the hashes to test.

So basically, if they are trying to use precalculated tables, using a salt effectively greatly increases the possible inputs they have to test in order to crack the password, and even if they aren't using precalculated tables, it still slows them down by a factor of N, where N is the number of passwords you are storing.

Hopefully this answers all your questions.

Solution 2

Think of 2 numbers from 1 to 9999. Add them. Now tell me the final digit.

I can't, from that information, deduce which numbers you originally thought of. That is a very simple example of a one-way hash.

Now, I can think of two numbers which give the same result, and this is where this simple example differs from a 'proper' cryptographic hash like MD5 or SHA1. With those algorithms, it should be computationally difficult to come up with an input which produces a specific hash.

Solution 3

One big reason you can't reverse the hash function is because data is lost.

Consider a simple example function: 'OR'. If you apply that to your input data of 1 and 0, it yields 1. But now, if you know the answer is '1', how do you back out the original data? You can't. It could have been 1,1 or maybe 0,1, or maybe 1,0.

As for salting and rainbow tables. Yes, theoretically, you could have a rainbow table which would encompass all possible salts and passwords, but practically, that's just too big. If you tried every possible combination of lower case letters, upper case, numbers, and twelve punctuation symbols, up to 50 characters long, that's (26+26+10+12)^50 = 2.9 x 10^93 different possibilities. That's more than the number of atoms in the visible universe.

The idea behind rainbow tables is to calculate the hash for a bunch of possible passwords in advance, and passwords are much shorter than 50 characters, so it's possible to do so. That's why you want to add a salt in front: if you add on '57sjflk43380h4ljs9flj4ay' to the front of the password. While someone may have already computed the hash for "pa55w0rd", no one will have already calculated the has for '57sjflk43380h4ljs9flj4aypa55w0rd'.

Solution 4

I don't think the md5 gives you the whole result - so you can't work backwards to find the original things that was md5-ed

Solution 5

md5 is 128bit, that's 3.4*10^38 combinations.

the total number of eight character length passwords:

only lowercase characters and numbers: 36^8 = 2.8*10^12
lower&uppercase and numbers: 62^8 = 2.18*10^14

You have to store 8 bytes for the password, 16 for the md5 value, that's 24 bytes total per entry.

So you need approx 67000G or 5200000G storage for your rainbow table. The only reason that it's actually possible to figure out passwords is because people use obvious ones.

View more solutions

18,846

Author by

Keavon

I'm a computer geek who loves dabbling in programming and creative projects including game development, video editing, graphic design, image editing, web design, and web development. I use C#, JavaScript, HTML and CSS and I'm comfortable in Unity and Blender. I like Windows and Android and I highly respect Linux, although I rarely use it. I also collect games and I love playing them, although I don't usually have much time to. If you want to contact me, you may do so on my website.

Updated on June 21, 2022

Comments

Keavon about 2 years

Why can't you just reverse the algorithm like you could reverse a math function? How is it possible to make an algorithm that isn't reversible?

And if you use a rainbow table, what makes using a salt impossible to crack it? If you are making a rainbow table with brute force to generate it, then it invents each plaintext value possible (to a length), which would end up including the salt for each possible password and each possible salt (the salt and password/text would just come together as a single piece of text).
Nemo almost 13 years

Even if it were 1-to-1, it would not necessarily be easy to invert. Consider taking the product of two large prime numbers, for instance... MD5 is hard to invert, but it has nothing to do with being many-to-one.
Keavon almost 13 years

Wow, I see. I am used to seeing salts (for an MMO called Flyff I used to spend some time doing private server work with for fun) super short, like Flyff's was "kikugalanet" which is relativly short. But thanks for the tip on making long and totally random hashes. I'm still a little confused about the first part of the answer about how the data is lost. How does it match up the hashes if it's either of the two, a 1 or a 0? Thank you for the great answer though!
Keavon almost 13 years

Wow, now I get it. That was a very good and simple way of putting it. THANK YOU!
Jeremy Salwen almost 13 years

Not quite. That's how much data you would have to precompute. A rainbow table is not a straightforward lookup table. The whole point is that it is a tradeoff between time and space storage. So, for example, if you had chains of length 1,000,000 it would take up 1,000,000th of the space you suggested, but would take 1,000,000 times as long as a simple lookup (which would still be relatively fast). For the numbers you quoted above, that would only be 67 MB, or 5.2 GB. Not bad, eh? But again, there are two caveats: 1. the time to calculate the tables is still really long 2. (cont)
Jeremy Salwen almost 13 years

2. Rainbow tables are actually statistical. So you would not get any password that is "lower&uppercase and numbers", but statistically a percentage of them, like 99.8 or something, depending on the likelihood of a collision of the reduction function, size padding you give the rainbow tables over how big you'd expect they need to be in order to get better probabilities.
Jeremy Salwen almost 13 years

Yes, I mean chaining, but no, you don't need to store every password. The idea is you have a reduction function which takes a hash output and brings it back into the password domain. So for each chain, I store (Initial Password, Final Hash) Now, if the chain is of length 1,000,000, this means that the final hash is what I get after taking the hash and then reduction function 1,000,000 times in a row. Now, to check a password, I take the hash, compare it to the end of every chain. If not, I take the reduction of the hash, then hash it again, compare again, etc, up to 1,000,000 times.
Jeremy Salwen almost 13 years

If at any point I get a match on one of the chains, then I start with the initial password of that chain, and keep taking the hash and then reduction function until I find a hash that matches the original hash I was looking for. Now I just remember the input I gave to the hash function to get that output, and I have the correct password. The point is that I chain down 1,000,000 times, so if I produce the original password anywhere in those 1,000,000 iterations, I can follow the chain down from the original password to find the value and the hash.
Keavon almost 13 years

Wow, did you write that all yourself? Or did you copy it? That is pretty long. I'll go ahead and read it now. lol. Thanks!
Karoly Horvath almost 13 years

THX! How is this supposed to work with salted passwords? Before reapplying the hash function you salt the hashed value, right?
Keavon almost 13 years

Okay, I read it. Ignore what I wrote above. Here's my new message: Wow, that is a SUPER long thing. Yes, I believe you did answer all my question. But for MD5 is takes each letter and converts it into a number and then does some complicated math algorithms? Also, I thought that salts stayed the same for every user. Because (look for my other comment on a different answer about my experience with Flyff), they used a single salt, "kikugalanet". Also, I don't understand how, if each person has their own salt, it knows what salt to use. If that's stored in the database, can't the hacker just use..
Keavon almost 13 years

... that too when generating their brute force for a single person (if they took access of the whole database of passwords)? Thank you so much for spending so much time writing that!
Jeremy Salwen almost 13 years

Exactly, the point is that you have to know the salt before you can start generating rainbow tables, and you can't reuse your tables, since each table will be specific to each salt. Essentially, this makes rainbow tables useless.
Jeremy Salwen almost 13 years

1. Yes I wrote it myself. Sometimes I just get in the mood for writing up long things. It helps me better understand them myself. 2. Yes, MD5 interprets each letter as a number (in fact, it's basically just doing a conversion from letter=>ascii code. 3. Salts can stay the same, but that isn't as powerful a protection technique as per-user salting. If you have one salt for the entire website, it prevents precomputed tables from effectively working on the hashes, but it still allows the attacker to test the same password with every user at once. So you don't get the additional slowdow
Jeremy Salwen almost 13 years

So per-website salting essentially means you're using a unique hashing function for your website, so they can't reuse other tables. However, per-user salting means that you're giving each user their own hash function, so the attacker has to devote large amounts of additional work for every user he wants to crack. As for storing the salt, yes, you must store it in the database with the hash. So the hacker has the salt if he has the hash. However, the point is that he can't precompute anything before getting the salt. Both per-user or per-website salts are effective in stopping this.
Rory Alsop almost 13 years

And in addition, 'pepper' values are also used in many environments. Check out the latest Security Stack Exchange blog post at security.blogoverflow.com/2011/07/06/… for some links.
Keavon almost 13 years

I get it now. So the hacker can, with the salt, still pre-compute it with that salt, but they have to compute it each time. And there's no point of pre-computing it if it's only used once. Unless somehow they manage to pre-compute everything including the salt for all passwords.
Nemo almost 13 years

"People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward." Not true, actually... Proving lower bounds is notoriously difficult. Nobody knows how to compute discrete log quickly (despite lots of smart people trying), but nobody knows how to prove it is hard, either. Other than that, excellent answer :-)
Jeremy Salwen almost 13 years

@Nemo: Updated to reflect reality. It's a little bit scary if you think about it...
J3STER about 5 years

this is amazing. I have a doubt though. In the first clock hash, you do h(7) = 5. but you cant retrieve the 7 from the 5 as h(x) does not have a feasible inverse. However, i think I dont need to retrieve the 7. Any number that will result in a 5 will do. And there are infinity of them. Shouldnt it be posible to transform the h(x) function into an equation that spits out a domain of all posible predecesors for that hash? like a vectorial space or sth like that.