perfect hash function

12,515

Solution 1

Found One

I tried a few things and found one semi-manually:

(n ^ 28) % 13

The semi-manual part was the following ruby script that I used to test candidate functions with a range of parameters:

t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
  t2 = t.map { |e| (e ^ i) % 13 }
  puts i if t2.uniq.length == t.length
end

Solution 2

if you know the exact keys then it is trivial to produce a perfect hash function -

int hash (int n) {
  switch (n) {
    case 10:   return 0;
    case 100:  return 1;
    case 32:   return 2;
    // ...
    default:   return -1;
  }
}

Solution 3

On some platforms (e.g. embedded), modulo operation is expensive, so % 13 is better avoided. But AND operation of low-order bits is cheap, and equivalent to modulo of a power-of-2.

I tried writing a simple program (in Python) to search for a perfect hash of your 11 data points, using simple forms such as ((x << a) ^ (x << b)) & 0xF (where & 0xF is equivalent to % 16, giving a result in the range 0..15, for example). I was able to find the following collision-free hash which gives an index in the range 0..15 (expressed as a C macro):

#define HASH(x)    ((((x) << 2) ^ ((x) >> 2)) & 0xF)

Here is the Python program I used:

data = [ 10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0 ]

def shift_right(value, shift_value):
    """Shift right that allows for negative values, which shift left
    (Python shift operator doesn't allow negative shift values)"""
    if shift_value == None:
        return 0
    if shift_value < 0:
        return value << (-shift_value)
    else:
        return value >> shift_value

def find_hash():
    def hashf(val, i, j = None, k = None):
        return (shift_right(val, i) ^ shift_right(val, j) ^ shift_right(val, k)) & 0xF

    for i in xrange(-7, 8):
        for j in xrange(i, 8):
            #for k in xrange(j, 8):
                #j = None
                k = None
                outputs = set()
                for val in data:
                    hash_val = hashf(val, i, j, k)
                    if hash_val >= 13:
                        pass
                        #break
                    if hash_val in outputs:
                        break
                    else:
                        outputs.add(hash_val)
                else:
                    print i, j, k, outputs

if __name__ == '__main__':
    find_hash()

Solution 4

Just some quasi-analytical ramblings:

In your set of numbers, eleven in all, three are odd and eight are even. Looking at the simplest forms of hashing - %13 - will give you the following hash values: 10 - 3, 100 - 9, 32 - 6, 45 - 6, 58 - 6, 126 - 9, 3 - 3, 29 - 3, 200 - 5, 400 - 10, 0 - 0

Which, of course, is unusable due to the number of collisions. Something more elaborate is needed.

Why state the obvious? Considering that the numbers are so few any elaborate - or rather, "less simple" - algorithm will likely be slower than either the switch statement or (which I prefer) simply searching through an unsigned short/long vector of size eleven positions and using the index of the match.

Why use a vector search?

  1. You can fine-tune it by placing the most often occuring values towards the beginning of the vector.
  2. I assume the purpose is to plug in the hash index into a switch with nice, sequential numbering. In that light it seems wasteful to first use a switch to find the index and then plug it into another switch. Maybe you should consider not using hashing at all and go directly to the final switch?
  3. The switch version of hashing cannot be fine-tuned and, due to the widely differing values, will cause the compiler to generate a binary search tree which will result in a lot of comparisons and conditional/other jumps (especially costly) which take time (I've assumed you've turned to hashing for its speed) and require space.
  4. If you want to speed up the vector search additionally and are using an x86-system you can implement a vector search based on the assembler instructions repne scasw (short)/repne scasd (long) which will be much faster. After a setup time of a few instructions you will find the first entry in one instruction and the last in eleven followed by a few instructions cleanup. This means 5-10 instructions best case and 15-20 worst. This should beat the switch-based hashing in all but maybe one or two cases.

Solution 5

Bob Jenkins has a program for this too: http://burtleburtle.net/bob/hash/perfect.html

Unless you're very lucky, there's no "nice" perfect hash function for a given dataset. Perfect hashing algorithms usually use a simple hashing function on the keys (using enough bits so it's collision-free) then use a table to finish it off.

Share:
12,515
gregghz
Author by

gregghz

Scala Programmer.

Updated on June 04, 2022

Comments

  • gregghz
    gregghz about 2 years

    I'm attempting to hash the values

    10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0
    

    I need a function that will map them to an array that has a size of 13 without causing any collisions.

    I've spent several hours thinking this over and googling and can't figure this out. I haven't come close to a viable solution.

    How would I go about finding a hash function of this sort? I've played with gperf, but I don't really understand it and I couldn't get the results I was looking for.

  • GManNickG
    GManNickG over 13 years
    Sorry, I edited it to be C++, then the C++ tag was removed from the question. You can revert if you want.
  • tobyodavies
    tobyodavies over 13 years
    meh, it was supposed to be C-like pseudo-code, it is better now, ty
  • Exectron
    Exectron almost 13 years
    That is effectively a table search. One of the benefits of a hash function is supposed to be: it allows a faster evaluation of hit/miss than a table search.
  • tobyodavies
    tobyodavies almost 13 years
    @Craig this is demonstrably faster than a table search, even with gcc -O0 this is 5 times faster than a simple linear table search on my machine, with -O2 the linear search takes over a second and time reports a total time of 0.00 for 1 million lookups... It is nearly identical to the accepted answer in speed over 100 million iterations with -O0 and within 0.2s with -O2. And this hash function is faster if all you need to do is work out if the key is present/valid - hash(n)==-1 requires no memory access... and you can safely add keys with the function remaining perfect.
  • Exectron
    Exectron almost 13 years
    Your hash might be faster than a data table look-up agred, but that's not what I meant. To clarify: Your hash might work well for this small data set, agreed. But for a data set of size n, your hash is O(n), whereas the accepted answer is O(1). The point of a hash function is to provide an O(1) solution. So in the particular case of the data set of the original question, your solution is satisfactory, but in the more general case of "finding a perfect hash function for data sets" (in particular, larger than some threshold), your answer isn't suitable.
  • tobyodavies
    tobyodavies almost 13 years
    But a perfect hash function does not make sense for anything larger than this number of values, and it's not a question about hash functions in general. Yet this is an effective method for efficiently implementing perfect hash functions in the range where they are appropriate. Also, the compiler can optimize to a jump table or search tree. It's not necessarily O(n) but often O(1) or O(log(n))
  • Exectron
    Exectron almost 13 years
    "But a perfect hash function does not make sense for anything larger than this number of values"—I've just managed to make a [non-minimal] perfect hash function that hashes my fixed data set of 154 16-bit values into an 8-bit output, without collisions. Then if I want to make a minimal perfect hash function out of that, I can do it with an extra 256-entry look-up table.
  • axel22
    axel22 about 7 years
    A good optimizing compiler will in the worst case produce an O(log(n)) lookup to find the correct switch case, which defeats the purpose of using the hash function for a hash table. In addition, you might want to derive the perfect hash function at runtime sometimes.