At which length is a String key of a HashMap considered bad-practice?

15,210

Solution 1

Not really, 150 chars String is relatively trivial to calculate an hashCode for.

That being said, in circumstances like this I would advise you to test it!

Create a routine that populates an HashMap with, say, insert a size here that is representative of your use scenario random values with 5 character strings as keys. Measure how long it takes. Then do the same for 15 character keys, and see how it scales.

Also, Strings in Java are immutable, which means that hashCode can be cached for each String that is stored in the String Constant Pool, and doesn't need to be recalculated when you call hashCode on the same String object.

This means that although you're calculating larger hash codes when creating your map, on access many of those will already be pre-calculated and cached, making the size of the original String even less relevant.

Solution 2

Is there an unwritten law to the length of the HashMap key?

If there is, it is also unspoken. I would measure your use case in a profiler and only worry about the things you can measure as a problem, not the things you can imagine might be a problem.

Is it considered bad practice to have String keys of let's say 150 characters?

I doubt it.

Does it affect performance? At which length?

Everything affects performance, usually to small to matter or sometimes even measure. The question should be; do you need 150 character keys. If you do, then use them.


There is an exotic case where adding strings with hashCode() of zero is a bad idea. This is because in Java 1.0 to 6 doesn't optimise the use case of a hashCode of zero and it can be predicted for denial of service attacks. Java 7 fixes this by having a secondary, less predictable hashcode.

Why doesn't String's hashCode() cache 0?

Solution 3

Long answer: A quick look at the source code of String::hashCode() reveals that the hash is cached after the first call. Meanwhile, String::equals() is O(n) if the strings are equal yet not identical (ie, equals() is true but == is false because they're allocated at different addresses).

So the impacts on performance you will see are with:

  • Passing never-before-hashed strings in calls to HashMap functions. However, generating lots of new strings will impact performance in itself.

  • Calls to HashMap::get() and HashMap::put()using a string key that is equal to a key already in the HashMap (because if the key isn't in the collection, then most likely only hashCode() will be called. But if it is, equals() will compare all characters until it determines the strings are equal). But only if the strings passed to these functions are not the same objects that are already in the HashMap, because in that case equals() is very fast.

  • Moreover, string literals, string constants, and manually intern()'d strings join the String Constant Pool, in which all "equal" strings are the same object with the same address. So if working exclusively with such strings, hashCode and equals are very fast.

Of course, the performance impact won't be at all noticeable unless you're doing the aforementioned operations in a tight loop (because 150 chars isn't long and hashCode() and equals() are both efficient).

Short answer: Benchmark.

Solution 4

First, there is no "unwritten rule". If long strings as keys make sense from an algorithmic perspective, use them. If profiling indicates that there is a problem, then you optimize.

So how can long strings affect hash table performance?

  • Long strings take more memory than short ones, and that could result in measurably longer garbage collection times, and other secondary performance effects related to hardware memory caches, TLBs and (potentially) physical memory page contention.

  • The hashcode algorithm for String uses all characters of the string and therefore its cost is proportional to the string length. This is mitigated by the fact that String hashcodes are cached. (The 2nd and subsequent time you call hashcode on a String, you get the cached value.) However, that only helps (here) if you do multiple hash table operations with the same String object as a key.

  • When you get a hash collision, the hash table falls back to using String.equals() to compare keys while searching the selected hash chain. In the worst case (e.g. when the strings are equal but not ==), String.equals() involves comparing all characters of the 2 strings.

As you can see, these effects are going to be specific to the actual application, and hence they are hard to predict. Hence an "unwritten rule" is unlikely to be helpful.

Share:
15,210
Caroline
Author by

Caroline

Graduated Master of Science: Computerscience: Software Engineering. Currently Visiting Researcher at Carnegie Mellon University.

Updated on June 05, 2022

Comments

  • Caroline
    Caroline about 2 years

    I try to pay attention to good performance and clean code all the time.

    I'm having difficulties trying to grasp whether it's sane to have a HashMap with keys of 150 characters.

    • Is there an unwritten law to the length of the HashMap key?
    • Is it considered bad practice to have String keys of let's say 150 characters?
    • Does it affect performance? At which length?
    • Marko Topolnik
      Marko Topolnik about 11 years
      Each successful get operation is inevitably linear in the key length. I think that is the only concern that needs taking into account. That still doesn't mean that 15-char keys will be 10 times faster to look up.
  • radai
    radai about 11 years
    nit-picking here, but java uses substantially more than 1 byte per character.
  • Vishy
    Vishy about 11 years
    @radai True, even with -XX:+UseCompressedStrings which can use one byte per character.
  • Vishy
    Vishy about 11 years
    I would pick a size which is representative of the sort of size you are likely to have.
  • Marko Topolnik
    Marko Topolnik about 11 years
    Another nitpick: Java 7's HashMap fixes it, not String. String#hashCode is basically unchanged.
  • Vishy
    Vishy about 11 years
    @MarkoTopolnik Java 7's String fixes it by supporting hash32() java-performance.info/changes-to-string-java-1-7-0_06 Changing String is key to how this is cached.
  • Marko Topolnik
    Marko Topolnik about 11 years
    True... however int32 is just a private helper method and the whole mechanism is not a part of the public API, or even publicly accessible.
  • Aleksandr Dubinsky
    Aleksandr Dubinsky about 11 years
    Not constructive. Standard tautological "benchmark it yourself" reply.
  • Vishy
    Vishy about 11 years
    @MarkoTopolnik In that case, the same applied to HashMap.
  • Marko Topolnik
    Marko Topolnik about 11 years
    My point: the whole mechanism is inaccessible outside of HashMap and other Java Collection Framework classes. Whatever logic you have of your own that relies on Object#hashCode, you can't profit from the enhancement. This is something to be aware of.
  • hongliang
    hongliang over 10 years
    I did a quick test on my machine. It took about 24 ms to compute the hash code of 1000 strings, each with the length of 10000 characters.
  • maaartinus
    maaartinus almost 10 years
    1. String.hashCode gets cached upon the first computation, unrelated to the pool. The computation itself takes 4 cycles per char, which is 4 times more than the optimum. 2. Because of this caching, the equals method will probably dominate the time.
  • maaartinus
    maaartinus almost 10 years
    And Java 8 unfixes it by removing hash32. Which is OK due to using tree bins in case of many conflicts.