Good Hash Function for Strings

432,017

Solution 1

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}

Solution 2

If it's a security thing, you could use Java crypto:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToHash.getBytes());
String stringHash = new String(messageDigest.digest());

Solution 3

You should probably use String.hashCode().

If you really want to implement hashCode yourself:

Do not be tempted to exclude significant parts of an object from the hash code computation to improve performance -- Joshua Bloch, Effective Java

Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.

Here's a war story paraphrased on the String hashCode from "Effective Java":

The String hash function implemented in all releases prior to 1.2 examined at most sixteen characters, evenly spaced throughout the string, starting with the first character. For large collections of hierarchical names, such as URLs, this hash function displayed terrible behavior.

Solution 4

If you are doing this in Java then why are you doing it? Just call .hashCode() on the string

Solution 5

Guava's HashFunction (javadoc) provides decent non-crypto-strong hashing.

Share:
432,017

Related videos on Youtube

Leif Andersen
Author by

Leif Andersen

Updated on October 07, 2020

Comments

  • Leif Andersen
    Leif Andersen over 3 years

    I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?

    I am doing this in Java, but I wouldn't imagine that would make much of a difference.

    • WhirlWind
      WhirlWind about 14 years
      Good hash functions depend heavily on the input to the hash, and the requirements of the algorithm. Such a hash will not be very good if all your strings start with the same five characters, for example. It will also tend to result in a normal distribution.
    • Michael Mrozek
      Michael Mrozek about 14 years
      Possible duplicate of 98153
    • Bart Kiers
      Bart Kiers about 14 years
      Why can't you use String's own hashCode()?
    • Leif Andersen
      Leif Andersen about 14 years
      @WhirlWind, true, I'm not sure what the strings will have, other than that it will probably english text.
    • Leif Andersen
      Leif Andersen about 14 years
      @Barl, mainly because my professor told us to implement our own hash functor...and the reason I didn't want to use Java's, was because it was generic, and I would imagine a more specific hash functor would be better.
    • Ayxan Haqverdili
      Ayxan Haqverdili over 4 years
    • Dmitry Gryazin
      Dmitry Gryazin about 4 years
      I have just been fighting against a mysterious bug for several hours. Turned out it's a hash collision with String.hashCode(). Hashing to int seems a bad idea.
    • Gabriel Staples
      Gabriel Staples over 2 years
      See also this Q&A with lots of good hash algorithms mentioned here: hash function for string
  • Leif Andersen
    Leif Andersen about 14 years
    I'm doing it as part of the class, and part of the assignment is to write up several different hash functions. The professor told us to get outside help for the 'better' ones.
  • Daniel Liu
    Daniel Liu over 13 years
    Nice. I have a machine-learning application, doing statistical NLP over a large corpus. After a few initial passes of morphological normalization on the original words in the text, I throw away the string values and use hash codes instead. Throughout my entire corpus, there are about 600,000 unique words, and using the default java hashcode function, I was getting about 3.5% collisions. But if I SHA-256 the string value and then generate a hashcode from the digested string, the collision ratio is less than 0.0001%. Thanks!
  • Stephen Ostermiller
    Stephen Ostermiller about 11 years
    If you need your has to be consistent across JVM versions and implementations, you should not rely on .hashCode(). Rather, use some known algorithm.
  • devsda
    devsda almost 11 years
    @jonathanasdf How can you say that it always gives you a unique hash key. Is there any mathematical proof ? I think we have to take mod of hash with another bigger prime number, otherwise overflow problem occurs.
  • Pharap
    Pharap over 10 years
    @devsda He didn't say always unique, he said more likely to be unique. As for why, a quick search on google reveals this article: computinglife.wordpress.com/2008/11/20/… explaining why 31 was used for Java string hashing. There is no mathematical proof given, but it does explain the general concept as to why primes work better.
  • CornSmith
    CornSmith over 10 years
    I think it's just a prime number to start at, so that we have fewer collisions.
  • Tim Sylvester
    Tim Sylvester over 10 years
    @benjismith One in a million is far too large... is "less than 0.0001%" an oblique way of saying "exactly 0"? I really doubt that you saw a SHA-256 collision because that has never been observed, anywhere, ever; not even for 160-bit SHA-1. If you have two strings that produce the same SHA-256, the security community would love to see them; you'll be world-famous... in a very obscure way. See Comparison of SHA Functions
  • Daniel Liu
    Daniel Liu over 10 years
    @TimSylvester, you misunderstood. I didn't find SHA-256 collisions. I computed the SHA-256 and then fed the resultant byte sequences into a typical Java "hashCode" function, because I needed a 32-bit hash. That's where I found the collisions. Nothing remarkable :)
  • supercat
    supercat over 10 years
    If one is using a double-hashed collection, it may be worthwhile to have the first hash be really quick and dirty. If one has a thousand long strings, half of which a are mapped by a crummy function to one particular value, and half of which are mapped to distinct values, performance in a single-hashed table would be bad, but performance in a double-hashed table, where the second hash examined the entire string, could be almost twice that of a singly-hashed table (since half the strings wouldn't have to be fully hashed). None of the standard Java collections do double hashing, though.
  • szgal
    szgal over 9 years
    You could just pass the byte array to messageDigest.update().
  • yshavit
    yshavit almost 9 years
    The algorithm for String::hashCode is specified in the JDK, so it's as portable as the very existence of the class java.lang.String.
  • whitehat
    whitehat over 8 years
    Thanks a lot for clarifying the idea of doing better hashing. Just to double check - The hashCode() return value will be used by Java to map to some table index before storing the object. So, if the hashCode() returns m, it does something like (m mod k) to get an index of the table of size k. Is that right?
  • Nav
    Nav over 7 years
    Isn't there a difference between 'hashing' and 'encrypting'? I understand MessageDigest is a one way hashing function, right? Also, when I used the function, I got the hashed string as a lot of junk UTF characters when I opened the file in LibreOffice. Is it possible to get the hashed string as a random bunch of alphanumeric characters instead of junk UTF characters?
  • Shawn
    Shawn about 7 years
    String encryptedString and stringToEncrypt.getBytes() refer to encryption, when this really is a hashing algorithm.
  • Dimitri Kopriwa
    Dimitri Kopriwa over 6 years
    My generated values look like this: "��'x�-E#��G����]%�c�3��u�" , is this normal thing and can I rely on this?
  • Jonathan Peterson
    Jonathan Peterson over 6 years
    A good hash function distributes the values equally across the buckets.
  • Mr Redstoner
    Mr Redstoner over 6 years
    if you need reasonable characters, try for instance base64 (I use apache commons)
  • clankill3r
    clankill3r over 4 years
    This hash function is so much better then the one that comes with java.
  • Bipin Maharjan
    Bipin Maharjan about 3 years
    you can convert the byte array into hex string to get an understandable string. here is how you do it. new BigInteger(1,byteArr).toString(16)
  • Taziamoma Abraham
    Taziamoma Abraham about 3 years
    This was amazing, you have no idea how much it helped. I understand that using 31 gives the best unique results, but is the 7 also the best possible? Or did you just pick a random prime number?
  • Zachary F
    Zachary F over 2 years
    I decreased my collisions by taking the final result mod the length of the string. (Im working in python so I had to change it up a little bit)