hash function providing unique uint from an integer coordinate pair

25,637

Solution 1

a hash function that is GUARANTEED collision-free is not a hash function :)

Instead of using a hash function, you could consider using binary space partition trees (BSPs) or XY-trees (closely related).

If you want to hash two uint32's into one uint32, do not use things like Y & 0xFFFF because that discards half of the bits. Do something like

(x * 0x1f1f1f1f) ^ y

(you need to transform one of the variables first to make sure the hash function is not commutative)

Solution 2

Cantor's enumeration of pairs

   n = ((x + y)*(x + y + 1)/2) + y

might be interesting, as it's closest to your original canvaswidth * y + x but will work for any x or y. But for a real world int32 hash, rather than a mapping of pairs of integers to integers, you're probably better off with a bit manipulation such as Bob Jenkin's mix and calling that with x,y and a salt.

Solution 3

Like Emil, but handles 16-bit overflows in x in a way that produces fewer collisions, and takes fewer instructions to compute:

hash = ( y << 16 ) ^ x;

Solution 4

You can recursively divide your XY plane into cells, then divide these cells into sub-cells, etc.

Gustavo Niemeyer invented in 2008 his Geohash geocoding system.

Amazon's open source Geo Library computes the hash for any longitude-latitude coordinate. The resulting Geohash value is a 63 bit number. The probability of collision depends of the hash's resolution: if two objects are closer than the intrinsic resolution, the calculated hash will be identical.

enter image description here

Read more:

https://en.wikipedia.org/wiki/Geohash https://aws.amazon.com/fr/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/ https://github.com/awslabs/dynamodb-geo

Solution 5

Your "ideal" is impossible.

You want a mapping (x, y) -> i where x, y, and i are all 32-bit quantities, which is guaranteed not to generate duplicate values of i.

Here's why: suppose there is a function hash() so that hash(x, y) gives different integer values. There are 2^32 (about 4 billion) values for x, and 2^32 values of y. So hash(x, y) has 2^64 (about 16 million trillion) possible results. But there are only 2^32 possible values in a 32-bit int, so the result of hash() won't fit in a 32-bit int.

See also http://en.wikipedia.org/wiki/Counting_argument

Generally, you should always design your data structures to deal with collisions. (Unless your hashes are very long (at least 128 bit), very good (use cryptographic hash functions), and you're feeling lucky).

Share:
25,637
AndreasT
Author by

AndreasT

Updated on September 10, 2020

Comments

  • AndreasT
    AndreasT almost 4 years

    The problem in general: I have a big 2d point space, sparsely populated with dots. Think of it as a big white canvas sprinkled with black dots. I have to iterate over and search through these dots a lot. The Canvas (point space) can be huge, bordering on the limits of int and its size is unknown before setting points in there.

    That brought me to the idea of hashing:

    Ideal: I need a hash function taking a 2D point, returning a unique uint32. So that no collisions can occur. You can assume that the number of dots on the Canvas is easily countable by uint32.

    IMPORTANT: It is impossible to know the size of the canvas beforehand (it may even change), so things like

    canvaswidth * y + x

    are sadly out of the question.

    I also tried a very naive

    abs(x) + abs(y)

    but that produces too many collisions.

    Compromise: A hash function that provides keys with a very low probability of collision.

    Any ideas anybody? Thanks for any help.

    Best regards, Andreas T.

    Edit: I had to change something in the question text: I changed the assumption "able to count the number of points of the canvas with uint32" into "able to count the dots on the canvas (or the number of coordinate pairs to store" by uint32. My original question didn't make much sense, because I would have had a sqrt(max(uint32))xsqrt(max(uint32)) sized canvas, which is uniquely representable by a 16 bit shift and OR.

    I hope this is ok, since all answers still make most sense with the updated assumptions

    Sorry for that.