Cuckoo hashing in C

12,125

Solution 1

http://www.mpi-inf.mpg.de/~sanders/programs/cuckoo/

HTH

Solution 2

As other answers have pointed out, it's true that the simplest cuckoo hashtable requires that the table be half empty. However, the concept has been generalized to d-ary cuckoo hashing, in which each key has d possible places to nest, as opposed to 2 places in the simple version.

The acceptable load factor increases quickly as d is increased. For only d=3, you can already use around a 75% full table. The downside is that you need d independent hash functions. I'm a fan of Bob Jenkins' hash functions for this purpose (see http://burtleburtle.net/bob/c/lookup3.c), which you might find useful in a cuckoo hashing implementation.

Solution 3

Cuckoo hashing is relatively unused outside of academia (aside from hardware caches, which sometimes borrow ideas from, but don't really implement fully). It requires a very sparse hash table to get good time on insertions - you really need to have 51% of your table empty for good performance. So it is either fast and takes a lot of space, or slow and uses space efficiently - never both. Other algorithms are both time and space efficient, although they are worse than cuckoo when only time or space is taken into account.

Here is a code generator for cuckoo hash tables. Check the license of the generator to verify that the output is non GPL. It should be, but check anyway.

-Adam

Solution 4

Even though it's an old question, someone might still be interested :)

This paper describes the implementation of a parallel d-ary cuckoo hash on GPUs (CUDA/OpenCL). It's described very well and implementing it based on the description is quite easy. Generally worth reading, if you're interested in this topic. (You'll need an ACM login though.)

Solution 5

I can't speak for software but cuckoo hashing is certainly used in hardware and becoming very popular. Major vendors of networking equipment have been looking into cuckoo hashing and some already use it. The attraction to cuckoo hashing comes from the constant lookup time, of course, but also the near constant insertion time.

Although insertion can theoretically be unbounded, in practice it can be bounded to O(log n) of the number of rows in the table(s) and when measured, the insertion time is about 1.1*d memory accesses on average. That's just 10% more than the absolute minimum! Memory access is often the limiting factor in networking equipment.

Independent hash functions are a must and selecting them properly is difficult. Good luck.

Share:
12,125
Remo.D
Author by

Remo.D

Old school programmer, I still prefer C over other languages.

Updated on June 01, 2022

Comments

  • Remo.D
    Remo.D about 2 years

    Does anybody have an implementation of Cuckoo hashing in C? If there was an Open Source, non GPL version it would be perfect!

    Since Adam mentioned it in his comment, anyone knows why it is not much used? Is it just a matter of implementation or the good theoretical properties do not materialize in practice?

  • Matt J
    Matt J over 15 years
    The generator itself is marked as GPLv3. Haven't figured out if the output is or not.
  • Adam Davis
    Adam Davis over 15 years
    It requires a table that is 51% empty for good performance, so it's time efficient, or space efficient, but not both. Other methods are simply better on both counts, although they may be worse on one or the other. Further, it's tricky to implement well (which is why there's a generator...)
  • Remo.D
    Remo.D over 15 years
    I had a look at the generator. Unfortunately it seems to be aimed to create static lookup tables rather then generic dynamic containers. Nevertheless it's a good alternative to gperf, I think!
  • Steve Jessop
    Steve Jessop over 15 years
    Personally I wouldn't call 49% "very sparse": an overhead of approx two words per pointer is comparable to a dlist or tree. But I wonder how often in practice you actually get to max density before an overpopulated cycle forces a rehash.
  • Steve Jessop
    Steve Jessop over 15 years
    Btw, O(1) lookup for hashtables is basically a myth. The minimum bits required to express N distinct values is proportional to log N, so the hashing function is almost always in effect O(log N). If you cache the hash value inside objects, it's O(1) for subsequent repeated lookups of the same object.
  • Steve Jessop
    Steve Jessop over 15 years
    That said, hashtables often are faster than trees. But that's because trees often crawl all over memory and/or use a "non-constant time" comparison function.
  • Remo.D
    Remo.D over 15 years
    O() was just referred to the number of comparison needed to check for a key. For the memory requirement: I think that I shown that 50% more space is (more or less) what one gets if he choose a tree based dictionary. I'm not sure I missed something so I'd welcome any comment on that!
  • Steve Jessop
    Steve Jessop over 15 years
    I agree with you on memory overhead - 8 bytes per object is often perfectly acceptable. So if a robust cuckoo hash implementation were available off the shelf then it might well be a good option. But people are rightly wary of "stunt" data structures until the boring ones are demonstrably too slow.
  • Steve Jessop
    Steve Jessop over 15 years
    One thing I do question, though, is whether the theoretical 49% bound is typically achieved before collisions force you to rehash. The real-world density of the hash might be considerably less than 49%, I don't know. Presumably the academic papers answer that, I just haven't read 'em.
  • Remo.D
    Remo.D over 15 years
    Thanks, that's very helpful as it comes with the report on d-ary hashing!
  • Justin Poliey
    Justin Poliey over 14 years
    The PHash.c file in the Io source is outdated. It has been deprecated by CHash.c in libbasekit which is packaged with Io. dekorte.com/projects/opensource/libbasekit
  • NateS
    NateS over 13 years
    A 0.49 load factor is needed for 2 hashes. Using 3 hashes improves to 0.91, so sparseness isn't a real argument against cuckoo hashing.
  • NateS
    NateS over 13 years
    Usually the table is restricted to a POT size to allow bitwise AND to be used on the hash rather than a modulus.
  • Eyal
    Eyal about 13 years
    51% is only if you're using just 2 hashes. Cuckoo with more hashes is more efficient. At 8 hashes you can get over 95%. And it is well used in hardware applications. (I know of 3 companies using it.)
  • Thomas Ahle
    Thomas Ahle over 12 years
    Well, "different hash functions" can also just be the same function with different seeds.
  • Demi
    Demi over 10 years
    Not so sure about the licensing here.
  • Arundale Ramanathan
    Arundale Ramanathan over 2 years
    getting 404 on the url