What is a good Hash Function?

170,278

Solution 1

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

Solution 2

There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.

Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.

Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.

This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.

Solution 3

There are two major purposes of hashing functions:

  • to disperse data points uniformly into n bits.
  • to securely identify the input data.

It's impossible to recommend a hash without knowing what you're using it for.

If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.

If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.

Solution 4

This is an example of a good one and also an example of why you would never want to write one. It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:

unsigned fnv_hash_1a_32 ( void *key, int len ) {
    unsigned char *p = key;
    unsigned h = 0x811c9dc5;
    int i;

    for ( i = 0; i < len; i++ )
      h = ( h ^ p[i] ) * 0x01000193;

   return h;
}

unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
    unsigned char *p = key;
    unsigned long long h = 0xcbf29ce484222325ULL;
    int i;

    for ( i = 0; i < len; i++ )
      h = ( h ^ p[i] ) * 0x100000001b3ULL;

   return h;
}

Edit:

  • Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.

Solution 5

I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.

Share:
170,278
ARF
Author by

ARF

Updated on November 15, 2021

Comments

  • ARF
    ARF over 2 years

    What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:

    function Hash(key)
      return key mod PrimeNumber
    end
    

    (mod is the % operator in C and similar languages)

    with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?

  • Chris Harris
    Chris Harris almost 15 years
    Konrad, you're surely correct from a theoretical perspective, but have you ever tried using the Paul Hsieh hash function I mentioned in my comment? It's really quite good against a lot of different kind of data!
  • Cthutu
    Cthutu over 13 years
    You may want to look at this site for some information on why these values are chosen:isthe.com/chongo/tech/comp/fnv/#fnv-prime
  • Tim Partridge
    Tim Partridge over 12 years
    updated link to The Hash Function Lounge: larc.usp.br/~pbarreto/hflounge.html
  • Kevin Hsu
    Kevin Hsu over 12 years
    How well does FNV withstand birthday collision compared to, say, the same number of bits off a SHA1?
  • Myrddin Emrys
    Myrddin Emrys over 12 years
    @Kevin As long as the avalanch characteristics of a hash are good (tiny changes in input = big changes in output) then birthday collisions are simply a function of bits in the hash. FNV-1a is excellent in this regard, and you can have as many or as few bits in the hash as you desire (though it takes a little extra effort to get a bit count that's not a power of 2).
  • Erik
    Erik over 11 years
    He doesn't seem to need anything cryptographically secure so SHA-1 would be way overkill.
  • nawfal
    nawfal about 11 years
    I had read from Jenkins' site that SFH is one of the best then, but I think Murmur might do better, see this excellent answer: programmers.stackexchange.com/questions/49550/…
  • Samuel Allan
    Samuel Allan about 10 years
    by the way even though no collisions for SHA-1 have been found it iss believed to be a matter of years or months before one is found. I would recommend using SHA-256.
  • Andrew Lazarus
    Andrew Lazarus about 8 years
    Hsieh's hash function is awful, with an order of magnitude more collisions than we want. In particular, strings that differ only in the last 4 bytes can collide easily. If you have a 30 character string, that differ in the last 4 bytes, after 28 bytes have been processes, the hashes differ only in the last 2 bytes. That means you are GUARANTEED a collision for one of the remaining two-byte values. (Yeah, it's fast. So what.)
  • Slava
    Slava over 7 years
    I think recommendation of cryptographic hash function is a really bad advise in this case.
  • Let Me Tink About It
    Let Me Tink About It about 6 years
    @Slava: Why? What are your reasons for saying a "cryptographic hash function is a really bad advise in this case?" Why is it bad advice? What are the relative disadvantages that make it so?
  • Slava
    Slava about 6 years
    @Mowzer because a hash function that is used in hash map should be fast and lightweight (assuming it still provides good hash), crypto hashes explicitly were maid to be computationally expensive to prevent brute force attack.
  • Honinbo Shusaku
    Honinbo Shusaku almost 3 years
    There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). - What's the difference in meaning between "universal hashes" and "universal hashing?"
  • Konrad Rudolph
    Konrad Rudolph almost 3 years
    @Abdul There isn’t one. My choice of words was simply atrocious when I wrote this answer. What I meant is that universal hash functions can only give guarantees about the expected case, i.e. average behaviour, not about worst-case behaviour. But in practice universal hashing is much better than my answer lets it sound. — Frankly, the whole answer isn’t very good and today I wouldn’t have written the initial paragraph like that.