String to Integer Hashing Function with Precision

12,301

Solution 1

The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.

In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.

Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.

That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)

Solution 2

Every hash will have collisions. Period. That's called a Birthday Problem.

You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.

Solution 3

Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.

Solution 4

MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.

Share:
12,301
Gayan
Author by

Gayan

Updated on June 13, 2022

Comments

  • Gayan
    Gayan about 2 years

    I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value. The function I've been using is given below:

    int GetHash(const char* zKey, int iPrecision /*= 6*/)
    {
            /////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
    
            unsigned long h = 0;
            long M = pow(10, iPrecision);
    
            while(*zKey)
            {
                    h = (h << 4) + *zKey++;
                    unsigned long g = h & 0xF0000000L;
                    if (g) h ^= g >> 24;
                    h &= ~g;
            }            
    
            return (int) (h % M);
    }
    

    The string to be hashed is similar to "SAEUI1210.00000010_1".

    However, this produces duplicate values in some cases. Are there any good alternatives which wouldn't duplicate the same hash for different string values.

  • Gayan
    Gayan about 15 years
    Which is the best hashing function to use out of the ones given in the link you've provided and the one that I'm using right now. The function that I'm using seems to be more complex than djb2 and sdbm. Does that mean it's better at avoiding collisions?
  • Gayan
    Gayan about 15 years
    Yes. But my requirement also includes the fact that the result has to be an integer. MD5 hashes contain both ints and chars. I think it's the same for SHA algorithms
  • MSalters
    MSalters about 15 years
    Perfect hashes by definition don't.
  • ASk
    ASk about 15 years
    The only way to test which hash function is "best" for your purposes, is to perform a benchmark on data sample that fits your expected real data. The function that you are using does not attempt to mix the input bits together too hard to create a hash - at each step, at most 4 topmost bits are mixed in; and in strings of length < 8, even that does not happen, your hash simply accumulates all characters, with a slight bit overlap.
  • Adam Matan
    Adam Matan about 15 years
    True, but the conversion is trivial - from 128 bit to 32 bit integer. You'll get a 2-line code (hash, int conversion) that produces a de-facto no collision hash.