Hash function for floats

26,216

Solution 1

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

Solution 2

If your hash function did the following you'd get some degree of fuzziness on the hash lookup

unsigned int Hash( float f )
{
    unsigned int ui;
    memcpy( &ui, &f, sizeof( float ) );
    return ui & 0xfffff000;
}

This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.

Solution 3

You can use the std hash, it's not bad:

 std::size_t myHash = std::cout << std::hash<float>{}(myFloat);

Solution 4

unsigned hash(float x)
{
    union
    {
        float f;
        unsigned u;
    };
    f = x;
    return u;
}

Technically undefined behavior, but most compilers support this. Alternative solution:

unsigned hash(float x)
{
    return (unsigned&)x;
}

Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.

Solution 5

You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...

Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.

An obvious case: -0.0 wont match 0.0 for example. *

Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).

Suggested steps for implementation:

  • filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
  • convert to an int of the same size
    (that is - use a union for example to represent the float as an int, not simply cast to an int).
  • re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.

*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).

Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).

Share:
26,216
Pacane
Author by

Pacane

wat?

Updated on July 14, 2022

Comments

  • Pacane
    Pacane almost 2 years

    I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...

    I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...

    Is there a good way to hash floats?

    You don't have to give me the function directly, but I'd like to see/understand different concepts...

    Notes:

    1. I don't need it to be really fast, just evenly distributed if possible.

    2. I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)