Best hashing algorithm in terms of hash collisions and performance for strings

42,357

Solution 1

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.

Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).

Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.

So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).

Solution 2

As Nigel Campbell indicated, there's no such thing as the 'best' hash function, as it depends on the data characteristics of what you're hashing as well as whether or not you need cryptographic quality hashes.

That said, here are some pointers:

  • Since the items you're using as input to the hash are just a set of strings, you could simply combine the hashcodes for each of those individual strings. I've seen the following pseudo-code suggested to do this, but I don't know of any particular analysis of it:

    int hashCode = 0;
    
    foreach (string s in propertiesToHash) {
        hashCode = 31*hashCode + s.GetHashCode();
    }
    

    According to this article, System.Web has an internal method that combines hashcodes using

    combinedHash = ((combinedHash << 5) + combinedHash) ^ nextObj.GetHashCode();
    

    I've also seen code that simply xor's the hashcodes together, but that seems like a bad idea to me (though I again have no analysis to back this up). If nothing else, you end up with a collision if the same strings are hashed in a different order.

  • I've used FNV to good effect: http://www.isthe.com/chongo/tech/comp/fnv/

  • Paul Hsieh has a decent article: http://www.azillionmonkeys.com/qed/hash.html

  • Another nice article by Bob Jenkins that was originally published in 1997 in Doctor Dobb's Journal (the linked article has updates): http://burtleburtle.net/bob/hash/doobs.html

Solution 3

There is no one single optimum hashing algorithm. If you have a known input domain you can use a perfect-hashing generator such as gperf to generate a hashing algorithm that will get a 100% rate on that particular input set. Otherwise, there is no 'right' answer to this question.

Solution 4

I am going to be lame here and give a more theoretical response rather a pin-pointing answer but please take the value in it.

First there are two distinct problems :

a. Collision probability b. Performance of hashing (i.e.: time, cpu-cycles etc.)

The two problems are mildly corellated. They are not perfectly correlated.

Problem a deals with the difference between the hashee and the resulted hash spaces. When you hash a 1KB file (1024 bytes) file and the hash has 32 bytes there will be :

1,0907481356194159294629842447338e+2466 (i.e. a number with 2466 zeros) possible combinations of input files

and the hash space will have

1,1579208923731619542357098500869e+77 (i.e. a number with 77 zeros)

The difference IS HUGE. there are 2389 zeros difference between them. THERE WILL BE COLLISIONS (a collision is a special case when two DIFFERENT input files will have the exact same hash) since we are reducing 10^2466 cases to 10^77 cases.

The only way to minimize collison risk is to enlarge the hash space and therefore to make the hahs longer. Ideally the hash will have the file length but this is somehow moronic.


The second problem is performance. This only deals with the algorithm of the hash. Ofcourse that a longer hash will most probably require more cpu cycles but a smarter algorithm might not. I have no clear case answer for this question. It's just too tough.

However you can benchmark/measure different hashing implementations and draw pre-conclusions from this.

Good luck ;)

Solution 5

The simple hashCode used by Java's String class might show a suitable algorithm.

Below is the "GNU Classpath" implementation. (License: GPL)

  /**
   * Computes the hashcode for this String. This is done with int arithmetic,
   * where ** represents exponentiation, by this formula:<br>
   * <code>s[0]*31**(n-1) + s[1]*31**(n-2) + ... + s[n-1]</code>.
   *
   * @return hashcode value of this String
   */
  public int hashCode()
  {
    if (cachedHashCode != 0)
      return cachedHashCode;

    // Compute the hash code using a local variable to be reentrant.
    int hashCode = 0;
    int limit = count + offset;
    for (int i = offset; i < limit; i++)
      hashCode = hashCode * 31 + value[i];
    return cachedHashCode = hashCode;
  }
Share:
42,357
dpan
Author by

dpan

Updated on July 17, 2022

Comments

  • dpan
    dpan almost 2 years

    What would be the best hashing algorithm if we had the following priorities (in that order):

    1. Minimal hash collisions
    2. Performance

    It doesn't have to be secure. Basically I'm trying to create an index based on a combination of properties of some objects. All the properties are strings.

    Any references to c# implementations would be appreciated.

  • Jon Skeet
    Jon Skeet over 15 years
    That's the algorithm used for the hashtable itself, after you've calculated the hash. The question is about how to calculate a good hash.
  • Nick Johnson
    Nick Johnson over 15 years
    He's hashing strings, not ints.
  • Steven Sudit
    Steven Sudit almost 15 years
    MurmurHash2 is very fast and well-distributed. murmurhash.googlepages.com
  • Steven Sudit
    Steven Sudit almost 15 years
    No, but there are some wrong ones. Some hashes just perform poorly in terms of distribution, not to mention execution time.
  • Andrei Rînea
    Andrei Rînea about 13 years
    Jon Skeet has spoken. You failed. :P
  • dbkk
    dbkk over 12 years
    Good advice when it comes to hashtables, but not for other uses of hashes (e.g. detecting if items are identical without keeping a copy of the other item).
  • Mecki
    Mecki over 11 years
    @dbkk: You are right, if you need to detect duplicates without keeping the date around, you would need a collision free hash... in theory. In practice you just use MD5 or SHA1, since these hashes are very good (albeit slow) and chances of collisions very, very low. For implementing a hashtable, though, both algorithms are way too slow and produce way too big hash values (32 bit hashes are ideal for hashtables, in some exceptional cases you may need 64 bit values; anything bigger than that is just waste of time).
  • TMS
    TMS about 7 years
    Yea that works, but it needs lot of calculation power.