Fast String Hashing Algorithm with low collision rates with 32 bit integer

93,702

Solution 1

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

Solution 2

Murmur Hash is pretty nice.

Solution 3

There's also a nice article at eternallyconfuzzled.com.

Jenkins' One-at-a-Time hash for strings should look something like this:

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
        hash += *s;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

Solution 4

For a fixed string-set use gperf.

If your string-set changes you have to pick one hash function. That topic has been discussed before:

What's the best hashing algorithm to use on a stl string when using hash_map?

Solution 5

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

  • Slow construction (needs lookup and possibly memory allocation)
  • Requires global data and synchronization in the case of concurrent threads
  • Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

Cheers,

Carl

Share:
93,702
Jason Citron
Author by

Jason Citron

I make games. I play games.

Updated on July 05, 2022

Comments

  • Jason Citron
    Jason Citron almost 2 years

    I have lots of unrelated named things that I'd like to do quick searches against. An "aardvark" is always an "aardvark" everywhere, so hashing the string and reusing the integer would work well to speed up comparisons. The entire set of names is unknown (and changes over time). What is a fast string hashing algorithm that will generate small (32 or 16) bit values and have a low collision rate?

    I'd like to see an optimized implementation specific to C/C++.

  • C. K. Young
    C. K. Young almost 16 years
    Yay for perfect hash generators!
  • Nick Johnson
    Nick Johnson almost 16 years
    CRCs are designed for error detection and correction. Their distribution characteristics are typically not very good.
  • Nils Pipenbrinck
    Nils Pipenbrinck almost 16 years
    Arachnid has obviously never tried CRC32 as hashes. They work well.
  • TimB
    TimB almost 16 years
    Perfect hashing is NOT appropriate for this application, since the set of names is unknown and changes. Therefore, gperf won't work for this.
  • obecalp
    obecalp over 15 years
    Both STL and boost tr1 has extremely weak hash function for strings.
  • obecalp
    obecalp over 15 years
    Yes, this is the current leading general purpose hash function for hash tables. It's non-crypto, of course, with a pair of obvious differential.
  • obecalp
    obecalp over 15 years
    "CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose." cf. home.comcast.net/~bretm/hash/8.html
  • Steven Sudit
    Steven Sudit almost 15 years
    If you're going to use FNV, stick to FNV-1a, since it has acceptable results on the avalanche test (see home.comcast.net/~bretm/hash/6.html). Or just use MurmurHash2, which is better in both speed and distribution (murmurhash.googlepages.com).
  • Steven Sudit
    Steven Sudit almost 15 years
    The hashes are quite solid, and technically interesting, but not necessarily fast. Consider that One-at-a-Time hash processes input byte by byte, where other hashes take 4 or even 8 bytes at a time. The speed differnece is substantial!
  • Sam
    Sam almost 15 years
    Bob's hashes are very fast: azillionmonkeys.com/qed/hash.html
  • Admin
    Admin over 13 years
    @Steven: MurmurHash hash has only been analzyed by its author. I've used it in a few different scenarios and the newer version of FNV seems to do a better job.
  • Steven Sudit
    Steven Sudit over 13 years
    @sonicoder: While I'm not going to oversell MurmurHash, plain FNV is downright terrible and FNV-1a is only passable. As it happens, MurmurHash has been extensively analyzed and found useful. It's still not a cryptographic hash and there are going to be collisions no matter what, but it's still a huge improvement over any type of FNV.
  • Steven Sudit
    Steven Sudit over 13 years
    A perfect hash is a very elegant solution, when available.
  • Admin
    Admin over 13 years
    @Steven Sudit: As I said, it was analyzed "only" by its author and no one else. hence the results of the "analysis" aren't really objective.
  • Steven Sudit
    Steven Sudit over 13 years
    @sonicoder: I'll speak more bluntly: no, you're mistaken. It was analyzed by a number of third parties, including academic ones. Visit Wikipedia for links. What's more important is that, not only did it do well in general, but the specific flaws that were found were addressed through the creation of MurmurHash3.
  • Steven Sudit
    Steven Sudit over 13 years
    Also, I would appreciate it if whoever's been dittoing sonicoder would take the time to explain themselves.
  • jokoon
    jokoon over 11 years
    I don't understand those bitshift values hval += (hval<<1) + (hval<<4) + (hval<<7) + (hval<<8) + (hval<<24);
  • Emil Styrke
    Emil Styrke over 10 years
    Note: the new URL for MurmurHash3 is code.google.com/p/smhasher
  • greybeard
    greybeard about 9 years
    How does that (given a hypertext browser that does display that links contents) map to (32 or 16) bit values, given character sets from, say, 24 to 1.111.998 code points? Base conversion is not a useful hash function.
  • DarthGizka
    DarthGizka almost 8 years
    FNV hashes are slow and of mediocre quality - do some measurements and read the actual papers (which contain quite a bit of hair-raising nonsense) instead of taking the author's claims at face value and helping them perpetuate the myth that FNV hashes are fast and of good quality. Stick with a hash of proven exceptional quality like MurmurHash or pick one of the many simple hashes that beat FNV in speed and quality. About the only thing that FNV has going for it is its simplicity, if performance is not a concern (or if the cost of operations is skewed, as in interpreted languages).
  • Admin
    Admin over 3 years
    @DarthGizka, Example of this "one of the many simple hashes that beat FNV in speed and quality"?? I'm really interested.
  • DarthGizka
    DarthGizka over 3 years
    @Richard: depending on your chosen application you could try self-xoring the rotated accumulator (with a rotation that is relatively prime to the word size) or an xorshift, in any case followed at the end of hashing by a spreading multiplication. CRC has excellent quality and can be really fast if you can use the intrinsics. When things get trickier I'm certain that Bob Jenkins's extensive and thorough research into hashes has covered exactly what you need, somewhere. Prime moduli can be fast if you multiply w/ inverse instead of dividing.