What's the best way to create a short hash, similar to what tiny Url does?
Solution 1
.NET string object has a GetHashCode() function. It returns an integer. Convert it into a hex and then to an 8 characters long string.
Like so:
string hashCode = String.Format("{0:X}", sourceString.GetHashCode());
More on that: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx
UPDATE: Added the remarks from the link above to this answer:
The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.
Notes to Callers
The value returned by GetHashCode is platform-dependent. It differs on the 32-bit and 64-bit versions of the .NET Framework.
Solution 2
Is your goal to create a URL shortener or to create a hash function?
If your goal is to create a URL shortener, then you don't need a hash function. In that case, you just want to pre generate a sequence of cryptographically secure random numbers, and then assign each url to be encoded a unique number from the sequence.
You can do this using code like:
using System.Security.Cryptography;
const int numberOfNumbersNeeded = 100;
const int numberOfBytesNeeded = 8;
var randomGen = RandomNumberGenerator.Create();
for (int i = 0; i < numberOfNumbersNeeded; ++i)
{
var bytes = new Byte[numberOfBytesNeeded];
randomGen.GetBytes(bytes);
}
Using the cryptographic number generator will make it very difficult for people to predict the strings you generate, which I assume is important to you.
You can then convert the 8 byte random number into a string using the chars in your alphabet. This is basically a change of base calculation (from base 256 to base 62).
Solution 3
I dont think URL shortening services use hashes, I think they just have a running alphanumerical string that is increased with every new URL and stored in a database. If you really need to use a hash function have a look at this link: some hash functions Also, a bit offtopic but depending on what you are working on this might be interesting: Coding Horror article
Solution 4
Just take a Base36 (case-insensitive) or Base64 of the ID of the entry.
So, lets say I wanted to use Base36:
(ID - Base36)
1 - 1
2 - 2
3 - 3
10 - A
11 - B
12 - C
...
10000 - 7PS
22000 - GZ4
34000 - Q8C
...
1000000 - LFLS
2345000 - 1E9EW
6000000 - 3KLMO
You could keep these even shorter if you went with base64 but then the URL's would be case-sensitive. You can see you still get your nice, neat alphanumeric key and with a guarantee that there will be no collisions!
Solution 5
You cannot use a short hash as you need a one-to-one mapping from the short version to the actual value. For a short hash the chance for a collision would be far too high. Normal, long hashes, would not be very user-friendly (and even though the chance for a collision would probably be small enough then, it still wouldn't feel "right" to me).
TinyURL.com seems to use an incremented number that is converted to Base 36 (0-9, A-Z).
Comments
-
bhb over 2 years
I'm currently using MD5 hashes but I would like to find something that will create a shorter hash that uses just
[a-z][A-Z][0-9]
. It only needs to be around 5-10 characters long.Is there something out there that already does this?
Update 1:
I like the CRC32 hash. Is there a clean way of calculating it in .NET?
Update 2:
I'm using the CRC32 function from the link Joe provided. How can I convert the uInt into the characters defined above?
-
Treb almost 15 yearsOf course you can. Maybe you shouldn't, but it's perfectly possible.
-
Stig Brautaset almost 15 yearsThat's not very unique. The following snippet of code shows that the sequence of numbers from 1 - 1000 has 30 collisions in the first 5 characters:
for f in
seq 0 10000` ; do md5 -s $f ; done | awk '{print substr($4, 0, 5)}' | sort | uniq -c | sort -n` -
Arjan almost 15 yearsYou're very right indeed. :-) One surely shouldn't use a short hash in this situation though. I'll edit my answer and rewrite my "You cannot create a short hash".
-
M4N almost 15 yearsSince he's looking for hash with a length of only 5 characters, I thought that uniqueness is not a strong requirement.
-
Arjan almost 15 yearsWell, referring to TinyURL.com suggest a 100% uniqueness requirement to me. So: no short hashes (or any hash if I'd program it).
-
Simeon Pilgrim almost 15 years-1: a CRC only provides error checking, not unique collision avoidance.
-
Kevin Montrose almost 15 yearsIf you don't need cryptographic guarantees, they do a pretty good job for damn cheap in terms of CPU.
-
Simeon Pilgrim almost 15 yearsbut two urls will make the same CRC, and therefore have the same short-url, which is useless for a shorting service.
-
Arjan almost 15 yearsSee codymanix's answer, stackoverflow.com/questions/1116860/…
-
Simeon Pilgrim almost 15 yearsThat's why tinyUrl etc dont hash or crc the url, they just assign the next number. Really depends on what actually trying to be solved here.
-
Kevin Montrose almost 15 yearsTrue, URL shortening services almost certainly publish a counter not a hash; but the question is for a hash function with a short output.
-
Arjan almost 15 yearsHmmm, wouldn't the variable length make it hard to reverse the encoding? When invoking
num2char
multiple times for longer numbers, the result would need some separator between each encoded value, to tell them apart while decoding again. That makes the result much longer than when using a fixed-length encoding. If one doesn't mind using the + and / characters, then Base 64 encoding is easier I guess. -
Paul almost 15 yearsAccording to the question, he's looking for some hash that's shorter than the MD5 that he's currently using, and that uses alphanumerics. So, the current hash is irreversible; I think that's a requirement, or at the least not a problem. And this doesn't have 'variable length' - you take 4 hex digits from the MD5 hash, then pass it to num2char. Then take the next 4, pass that number to num2char, etc. The MD5 hash has 32 hex digits. The string you get out of my algorithm uses 32/4=8 alphanumeric characters.
-
Arjan almost 15 yearsOf course, the MD5 is irreversible, but isn't the idea that your mapping should be able to decode back to that MD5 value? As for variable length: I was wrong indeed. (I thought 0 would yield "a0", while 25 would yield "a25", but that's obviously "a" and "z" -- don't know how I could be so confused.) However, returning "0" and "1" for 62 and 63 will yield duplicates from the 3rd if(..), right? Base 64 needs the + and / characters for a reason... ;-) (And I guess the 3rd
if
reads(int)x - 52
instead?) -
Paul almost 15 yearshmmm. I didn't consider that my mapping should be decodable back to MD5... I do realize that returning "0" and "1" for 62 and 63 create possible duplicates, which could be a problem, but I was just outlining an idea here. If I can think of a better way, that's easy to interpret and/or elegant, I'll edit my post. Thanks for pointing out my error on the third if statement btw :)
-
Piotr Kula almost 12 yearsShort and sweet. Like .NET intended.
-
Piotr Kula almost 12 yearsI like this. :) +1 but how do we do it in .net- quickly?
-
Brenda Bell almost 12 yearsThe only problem with String.GetHashCode is that it will generate different values on different platforms (32-bit vs. 64-bit). If you're expecting the hash code to be produced and consumed by different applications, you'll need to be careful.
-
eduncan911 over 11 yearsAs Brenda stated, GetHashCode() is different on 32 and 64 systems. And, is even different between .net 1.1 and 2.0 CLRs. But most importantly, GetHashCode() is not guaranteed unique! You can get the same hash from two different strings (I know, it happened to me in a production environment).
-
Alex Kofman over 10 yearsGetHashCode() is not suitable for such tasks. It's not guarantied to have the same value in next .NET version.
-
Victor Stoddard over 9 yearsI like how this gives you easy control over the characters, allowing you to exclude characters that are visually ambiguous, like 0, O, l, I, 1, etc.
-
SAJ14SAJ about 9 yearsThis is a very bad idea, as the exact algorithm by which hash codes are generated for a given class is an implementation detail which should never be persisted, because it can change between .NET versions. In fact, it HAS changed between .NET versions.
-
simhumileco over 4 yearsThank you for mention about Base36