What is a good hash function for a collection (i.e., multi-set) of integers?

15,538

Solution 1

I asked this same question on cstheory.stackexchange.com and got a good answer:

https://cstheory.stackexchange.com/questions/3390/is-there-a-hash-function-for-a-collection-i-e-multi-set-of-integers-that-has

Solution 2

I agree with Dzmitry on using of arithmetic SUM of hashes, but I'd recommend using a hash function with good output distribution for input integers instead of just reversing bits in the integer. Reversing bits doesn't improve output distribution. It can even worsen output distribution, since the probability that the high order bits will be lost due sum overflow is much higher that the probability that the low order bits will be lost in this case. Here is an example of a fast hash function with good output distribution: http://burtleburtle.net/bob/c/lookup3.c . Read also the paper describing how hash functions must be constructed - http://burtleburtle.net/bob/hash/evahash.html .

Using SUM of hash values for each element in the set satisfies requirements in the questions:

  • memory usage is constant. We need to store an ordinary integer containing hash value per each set. This integer will be used for O(1) updating of the hash when adding/removing elements from the set.
  • Addition of a new element requires only addition of the element's hash value to the existing hash value, i.e. the operation is O(1).
  • Removing of existing element requires only subtraction of the element's hash value from the existing hash value, i.e. the operation is O(1).
  • The hash will be different for sets, which differ only by pairs of identical elements.

SUM and SUB are safe operations in the face of integer overflow, since they are reversible in a modular arithmetic, where modulus is 2^32 or 2^64 for integers in java.

Solution 3

Reverse-bits.

For example 00001011 become 11010000. Then, just SUM all the reversed set elements.


If we need O(1) on insert/delete, the usual SUM will work (and that's how Sets are implemented in Java), though not well distributed over sets of small integers.

In case our set will not be uniformly distributed (as it usually is), we need mapping N->f(N), so that f(N) would be uniformly distributed for the expected data sample. Usually, data sample contains much more close-to-zero numbers than close-to-maximum numbers. In this case, reverse-bits hash would distribute them uniformly.

Example in Scala:

def hash(v: Int): Int = {
        var h = v & 1
        for (i <- 1 to 31) {
                h <<= 1;
                h |= ((v >>> i) & 1)
        }
        h
}
def hash(a: Set[Int]): Int = {
        var h = 0
        for (e: Int <- a) {
                h += hash(e);
        }
        h
}

But the hash of our multi-set will not be uniform, though much better than simple SUM.

Share:
15,538

Related videos on Youtube

jonderry
Author by

jonderry

Updated on January 06, 2020

Comments

  • jonderry
    jonderry over 4 years

    I'm looking for a function that maps a multi-set of integers to an integer, hopefully with some kind of guarantee like pairwise independence.

    Ideally, memory usage would be constant, and the hash value could be updated in O(1) time after an insert/delete. (This forbids doing something like sorting the integers and using a hash function like h(x) = h_1(x_1, h_2(x_2, h_3(x_3, x_4))).)

    XORing hashes together doesn't work because h({1,1,2}) = h({2})

    I think multiplying hashes together modulo a prime might work if the underlying hash function had an unrealistically strong guarantee, such as n-independence.

  • jonderry
    jonderry over 13 years
    You can't just concatenate because the order of integers in a collection isn't supposed to affect the hash value.
  • Dzmitry Lazerka
    Dzmitry Lazerka over 13 years
    Also, concatenation won't work well, because h({12, 3}) == h({1, 23}).
  • jonderry
    jonderry over 13 years
    This is similar to what I suggested in the question. You can delete by multiplying by the modular inverse of the hash of the element to be removed. This requires keeping the hash of the collection modulo a prime (and ensuring that the hash of an element is non-zero). The question I have about this method is how to get a theoretical guarantee.
  • jonderry
    jonderry over 13 years
    This may work in practice, though it's easy to raise the odds of a collision fairly significantly. Consider the odds that {1, 1,..., 1} collides with {2, 2,..., 2} if there are, say, 2^16 copies in each set. This can be helped by taking the sum modulo a prime. Still, there's no theoretical guarantee, which is something else I'm curious about (whether one can get a good theoretical guarantee on the distribution of a hash function meeting the above specs for a collection).
  • Dzmitry Lazerka
    Dzmitry Lazerka over 13 years
    > Reversing bits doesn't improve output distribution. It does. If majority of input values are close-to-zero, as I stated, then it does it in the best way.
  • Dzmitry Lazerka
    Dzmitry Lazerka over 13 years
    "... high order bits will be lost" No bits are lost while doing SUM. You have proved that yourself in the last sentence :)
  • Dzmitry Lazerka
    Dzmitry Lazerka over 13 years
    But your proposal is also good, I think. It is tolerant to cases when input data is all-even or all-odd numbers (or greater powers of 2).
  • Dzmitry Lazerka
    Dzmitry Lazerka over 13 years
    But if the input data is free of such patterns, reverse-bits is the best function for general hashtables. That's because changes in lower bits if input numbers, causes placing the hashes in different buckets of hashtable. While your proposal does not distinguish is input data close-to-zero (as it usually the case) or not.
  • ma11hew28
    ma11hew28 over 8 years
    Concatenation only works if you sort the integers first, which the question forbids.
  • jeffreyveon
    jeffreyveon almost 4 years
    Is there a way this hash function can be modified to support order dependence? i.e. [a, b, c] and [b, c, a] producing different hash values?
  • martinus
    martinus almost 4 years
    @jeffreyveon if you need order dependence then you can use any hash.
  • jeffreyveon
    jeffreyveon almost 4 years
    I have a list of numbers. Each number in the list can be hashed independently, but how do I then combine those hashes to a single hash that signifies the list as a whole? One approach I have taken is to use the index position as part of the hash update, like this: this->hash *= (1779033703 + 2*x*i); where i is the index of element x in the list.
  • martinus
    martinus almost 4 years
    you could use something like boost::hash_combine, but I don't see a reason for using anything special? why not just use e.g. murmurhash to hash the whole data block?