Fastest Hash algorithm in Java for Strings

10,946

This amazing answer on Programmers Stack Exchange tells you all you need to know.

The short version is, use FNV-1a, aka the Fowler–Noll–Vo hash function, it has excellent performance, high randomness and low collisions.

Any further explanation I might shed on this question would be just be a copy and paste from that Programmers.SE answer, which incidentally is the second highest voted answer on the entire site.

Some other thoughts:

  • Ultimately, you have a pretty niche use case. Most people aren't dealing with 1 billion entry datasets regularly. As such, you may have to do your own benchmarking.
  • That said, having a high randomness suggests that the algorithm is likely to scale well for English hashes.
  • You haven't really talked about other issues; are you able to keep the entire data set in memory? What are your footprint requirements?

See also: Fastest Hash Algorithm for Text Data

Share:
10,946
Henri Lapierre
Author by

Henri Lapierre

Updated on June 14, 2022

Comments

  • Henri Lapierre
    Henri Lapierre about 2 years

    To make it simple, my question is: how to hash a String (about 200 characters) as quickly as possible. Security is not important, but collisions ARE a big deal.

    Note: After a quick investigation, it seems like MurmurHash3 might be the best choice. I am open to any comment saying otherwise tho'

    First, I know that there are plenty of other similar question, but I couldn't find a convincing answer yet.

    I have a list of objects, each containing a list of about 3k paragraphs which is saved to a database. Every X hours, those paragraph are regenerated and I need to find if any paragraphs has changed, and if so, push only those new paragraphs.

    The quickest way I found to find the differences (knowing that most of the time the content will be identical) is to create a MerkleTree, save it to the DB, and iterate over the MerkleTree to find the differences, instead of comparing the paragraphs themselves.

    This imply, in my case, that I will be creating ten thousands of hashes per second to compare with what is in the DB. Therefore, I need a very efficient way to create those hashes. I don't care about the security, I only need to ensure that the number of collision remains very very low.

    What would be the best algorithm available in Java for that?


    In my case, the main object is composed of Sections, which is composed of Languages, which is composed of Paragraph. The comparison strategy is:

    1) If the object hash is identical, stop, otherwise go to 2)

    2) Loop on all Section, keep only the Section with a different hash

    3) Loop on all Languages of those Sections, keep only the language with a different hash

    4) Loop on all the Paragraph of all those Languages, if the hash is different, then push the new content.

  • Henri Lapierre
    Henri Lapierre almost 9 years
    Sounds cool, but I am a bit disappointed to see collision on a dataset of only 250k. To be clear, collision is a big deal for me, and I have over 1 billion entries. When looking at an algorithm with over 2^128 possibilities, you wouldn't expect any collision on such a small dataset ?
  • Olivier Grégoire
    Olivier Grégoire almost 9 years
    If you think about the reason for the collision, it's rather normal. The collisions happen on one-word data, so the data is actually very compact and having collisions is normal. The bigger the data, the lesser the collision. You say you've entire paragraphs, test the algorithms on the first 250k paragraphs you have and check the collisions in your actual context rather than in the guy's specific context.
  • Henri Lapierre
    Henri Lapierre almost 9 years
    I am willing to buy that. This being said, do you have an explanation on why a shorter String would have more chance to collide or is that just a theory?
  • Olivier Grégoire
    Olivier Grégoire almost 9 years
    For properly done hashes, the value to hash should be greater than the hash size. You say you want to hash text. Text is usually defined as 26 characters (lowercase) + 26 characters (uppercase) + punctuation and space (+/- 10 characters). That's roughly 6 bits of entropy. If your hash has a space of 64 bits, for your hashes to be relevant, you want at least 11 characters (ceil(64/6)). The guy in the link did his test on the dictionary. I'm sure 90% of words have less than 11 chars. So his test was good at testing speed. But testing real hash distribution? Nope. Way too few entropy.
  • Olivier Grégoire
    Olivier Grégoire almost 9 years
    I won't write any answer: it wouldn't answer the question itself, but thanks for the appreciation ;)
  • durron597
    durron597 almost 9 years
    @OlivierGrégoire It's much less than 6 bits of entropy, by the way: what-if.xkcd.com/34/…