What's the best way to create a short hash, similar to what tiny Url does?

64,197

Solution 1

.NET string object has a GetHashCode() function. It returns an integer. Convert it into a hex and then to an 8 characters long string.

Like so:

string hashCode = String.Format("{0:X}", sourceString.GetHashCode());

More on that: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx

UPDATE: Added the remarks from the link above to this answer:

The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

Notes to Callers

The value returned by GetHashCode is platform-dependent. It differs on the 32-bit and 64-bit versions of the .NET Framework.

Solution 2

Is your goal to create a URL shortener or to create a hash function?

If your goal is to create a URL shortener, then you don't need a hash function. In that case, you just want to pre generate a sequence of cryptographically secure random numbers, and then assign each url to be encoded a unique number from the sequence.

You can do this using code like:

using System.Security.Cryptography;

const int numberOfNumbersNeeded = 100;
const int numberOfBytesNeeded = 8;
var randomGen = RandomNumberGenerator.Create();
for (int i = 0; i < numberOfNumbersNeeded; ++i)
{
     var bytes = new Byte[numberOfBytesNeeded];
     randomGen.GetBytes(bytes);
}

Using the cryptographic number generator will make it very difficult for people to predict the strings you generate, which I assume is important to you.

You can then convert the 8 byte random number into a string using the chars in your alphabet. This is basically a change of base calculation (from base 256 to base 62).

Solution 3

I dont think URL shortening services use hashes, I think they just have a running alphanumerical string that is increased with every new URL and stored in a database. If you really need to use a hash function have a look at this link: some hash functions Also, a bit offtopic but depending on what you are working on this might be interesting: Coding Horror article

Solution 4

Just take a Base36 (case-insensitive) or Base64 of the ID of the entry.

So, lets say I wanted to use Base36:

(ID - Base36)
1 - 1
2 - 2
3 - 3
10 - A
11 - B
12 - C
...
10000 - 7PS
22000 - GZ4
34000 - Q8C
...
1000000 - LFLS
2345000 - 1E9EW
6000000 - 3KLMO

You could keep these even shorter if you went with base64 but then the URL's would be case-sensitive. You can see you still get your nice, neat alphanumeric key and with a guarantee that there will be no collisions!

Solution 5

You cannot use a short hash as you need a one-to-one mapping from the short version to the actual value. For a short hash the chance for a collision would be far too high. Normal, long hashes, would not be very user-friendly (and even though the chance for a collision would probably be small enough then, it still wouldn't feel "right" to me).

TinyURL.com seems to use an incremented number that is converted to Base 36 (0-9, A-Z).

Share:
64,197
bhb
Author by

bhb

Web Developer

Updated on October 28, 2021

Comments

  • bhb
    bhb over 2 years

    I'm currently using MD5 hashes but I would like to find something that will create a shorter hash that uses just [a-z][A-Z][0-9]. It only needs to be around 5-10 characters long.

    Is there something out there that already does this?

    Update 1:

    I like the CRC32 hash. Is there a clean way of calculating it in .NET?

    Update 2:

    I'm using the CRC32 function from the link Joe provided. How can I convert the uInt into the characters defined above?

  • Treb
    Treb almost 15 years
    Of course you can. Maybe you shouldn't, but it's perfectly possible.
  • Stig Brautaset
    Stig Brautaset almost 15 years
    That's not very unique. The following snippet of code shows that the sequence of numbers from 1 - 1000 has 30 collisions in the first 5 characters: for f in seq 0 10000` ; do md5 -s $f ; done | awk '{print substr($4, 0, 5)}' | sort | uniq -c | sort -n`
  • Arjan
    Arjan almost 15 years
    You're very right indeed. :-) One surely shouldn't use a short hash in this situation though. I'll edit my answer and rewrite my "You cannot create a short hash".
  • M4N
    M4N almost 15 years
    Since he's looking for hash with a length of only 5 characters, I thought that uniqueness is not a strong requirement.
  • Arjan
    Arjan almost 15 years
    Well, referring to TinyURL.com suggest a 100% uniqueness requirement to me. So: no short hashes (or any hash if I'd program it).
  • Simeon Pilgrim
    Simeon Pilgrim almost 15 years
    -1: a CRC only provides error checking, not unique collision avoidance.
  • Kevin Montrose
    Kevin Montrose almost 15 years
    If you don't need cryptographic guarantees, they do a pretty good job for damn cheap in terms of CPU.
  • Simeon Pilgrim
    Simeon Pilgrim almost 15 years
    but two urls will make the same CRC, and therefore have the same short-url, which is useless for a shorting service.
  • Arjan
    Arjan almost 15 years
    See codymanix's answer, stackoverflow.com/questions/1116860/…
  • Simeon Pilgrim
    Simeon Pilgrim almost 15 years
    That's why tinyUrl etc dont hash or crc the url, they just assign the next number. Really depends on what actually trying to be solved here.
  • Kevin Montrose
    Kevin Montrose almost 15 years
    True, URL shortening services almost certainly publish a counter not a hash; but the question is for a hash function with a short output.
  • Arjan
    Arjan almost 15 years
    Hmmm, wouldn't the variable length make it hard to reverse the encoding? When invoking num2char multiple times for longer numbers, the result would need some separator between each encoded value, to tell them apart while decoding again. That makes the result much longer than when using a fixed-length encoding. If one doesn't mind using the + and / characters, then Base 64 encoding is easier I guess.
  • Paul
    Paul almost 15 years
    According to the question, he's looking for some hash that's shorter than the MD5 that he's currently using, and that uses alphanumerics. So, the current hash is irreversible; I think that's a requirement, or at the least not a problem. And this doesn't have 'variable length' - you take 4 hex digits from the MD5 hash, then pass it to num2char. Then take the next 4, pass that number to num2char, etc. The MD5 hash has 32 hex digits. The string you get out of my algorithm uses 32/4=8 alphanumeric characters.
  • Arjan
    Arjan almost 15 years
    Of course, the MD5 is irreversible, but isn't the idea that your mapping should be able to decode back to that MD5 value? As for variable length: I was wrong indeed. (I thought 0 would yield "a0", while 25 would yield "a25", but that's obviously "a" and "z" -- don't know how I could be so confused.) However, returning "0" and "1" for 62 and 63 will yield duplicates from the 3rd if(..), right? Base 64 needs the + and / characters for a reason... ;-) (And I guess the 3rd if reads (int)x - 52 instead?)
  • Paul
    Paul almost 15 years
    hmmm. I didn't consider that my mapping should be decodable back to MD5... I do realize that returning "0" and "1" for 62 and 63 create possible duplicates, which could be a problem, but I was just outlining an idea here. If I can think of a better way, that's easy to interpret and/or elegant, I'll edit my post. Thanks for pointing out my error on the third if statement btw :)
  • Piotr Kula
    Piotr Kula almost 12 years
    Short and sweet. Like .NET intended.
  • Piotr Kula
    Piotr Kula almost 12 years
    I like this. :) +1 but how do we do it in .net- quickly?
  • Brenda Bell
    Brenda Bell almost 12 years
    The only problem with String.GetHashCode is that it will generate different values on different platforms (32-bit vs. 64-bit). If you're expecting the hash code to be produced and consumed by different applications, you'll need to be careful.
  • eduncan911
    eduncan911 over 11 years
    As Brenda stated, GetHashCode() is different on 32 and 64 systems. And, is even different between .net 1.1 and 2.0 CLRs. But most importantly, GetHashCode() is not guaranteed unique! You can get the same hash from two different strings (I know, it happened to me in a production environment).
  • Alex Kofman
    Alex Kofman over 10 years
    GetHashCode() is not suitable for such tasks. It's not guarantied to have the same value in next .NET version.
  • Victor Stoddard
    Victor Stoddard over 9 years
    I like how this gives you easy control over the characters, allowing you to exclude characters that are visually ambiguous, like 0, O, l, I, 1, etc.
  • SAJ14SAJ
    SAJ14SAJ about 9 years
    This is a very bad idea, as the exact algorithm by which hash codes are generated for a given class is an implementation detail which should never be persisted, because it can change between .NET versions. In fact, it HAS changed between .NET versions.
  • simhumileco
    simhumileco over 4 years
    Thank you for mention about Base36