What is a good 64bit hash function in Java for textual strings?

56,109

Solution 1

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

Solution 2

An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

Solution 3

Create an SHA-1 hash and then mask out the lowest 64bits.

Solution 4

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"

Solution 5

long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
Share:
56,109
ripper234
Author by

ripper234

See blog or LinkedIn Profile

Updated on January 16, 2020

Comments

  • ripper234
    ripper234 over 4 years

    I'm looking for a hash function that:

    1. Hashes textual strings well (e.g. few collisions)
    2. Is written in Java, and widely used
    3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
    4. Bonus: Has a 128-bit variant.
    5. Bonus: Not CPU intensive.