Java HashMap performance optimization / alternative

125,974

Solution 1

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {       
    // assume that both a and b are sorted       
    return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}

public static int powerOf52(byte b, int power) {
    int result = b;
    for (int i = 0; i < power; i++) {
        result *= 52;
    }
    return result;
}

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:

168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083

Using the new method gives:

337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

Solution 2

One thing I notice in your hashCode() method is that the order of the elements in the arrays a[] and b[] don't matter. Thus (a[]={1,2,3}, b[]={99,100}) will hash to the same value as (a[]={3,1,2}, b[]={100,99}). Actually all keys k1 and k2 where sum(k1.a)==sum(k2.a) and sum(k1.b)=sum(k2.b) will result in collisions. I suggest assigning a weight to each position of the array:

hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);

where, c0, c1 and c3 are distinct constants (you can use different constants for b if necessary). That should even out things a bit more.

Solution 3

To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.

Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.

So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.

Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.

@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.

But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.

Update long after the original post

I just got an up-vote on this answer 6 years after posting which led me to re-read the question.

The hash function given in the question is not a good hash for 26 million entries.

It adds together a[0]+a[1] and b[0]+b[1]+b[2]. He says values of each byte range from 0 to 51, so that gives only (51*2+1)*(51*3+1)=15,862 possible hash values. With 26 million entries, this means an average of about 1639 entries per hash value. That is lots and lots of collisions, requiring lots and lots of sequential searches through linked lists.

The OP says that different orders within array a and array b should be considered equal, i.e. [[1,2],[3,4,5]].equals([[2,1],[5,3,4]]), and so to fulfill the contract they must have equal hash codes. Okay. Still, there are a lot more than 15,000 possible values. His second proposed hash function is much better, giving a broader range.

Though as someone else commented, it seems inappropriate for a hash function to change other data. It would make more sense to "normalize" the object when it is created, or to have the hash function work from copies of the arrays. Also, using a loop to calculate constants every time through the function is inefficient. As there are only four values here, I would have either written

return a[0]+a[1]*52+b[0]*52*52+b[1]*52*52*52+b[2]*52*52*52*52;

which would cause the compiler to perform the calculation once at compile time; or have 4 static constants defined in the class.

Also, the first draft at a hash function has several calculations that do nothing to add to the range of outputs. Note he first sets hash =503 than multiplies by 5381 before even considering values from the class. So ... in effect he adds 503*5381 to every value. What does this accomplish? Adding a constant to every hash value just burns cpu cycles without accomplishing anything useful. Lesson here: Adding complexity to a hash function is not the goal. The goal is to get a broad range of different values, not just to add complexity for the sake of complexity.

Solution 4

I'd suggest a three-pronged approach:

  1. Run Java with more memory: java -Xmx256M for example to run with 256 Megabytes. Use more if needed and you have lots of RAM.

  2. Cache your calculated hash values as suggested by another poster, so each object only calculates its hash value once.

  3. Use a better hashing algorithm. The one you posted would return the same hash where a = {0, 1} as it would where a ={1, 0}, all else being equal.

Utilise what Java gives you for free.

public int hashCode() {
    return 31 * Arrays.hashCode(a) + Arrays.hashCode(b);
}

I'm pretty sure this has much less chance of clashing than your existing hashCode method, although it depends on the exact nature of your data.

Solution 5

Getting into the gray area of "on/off topic", but necessary to eliminate confusion regarding Oscar Reyes suggestion that more hash collisions is a good thing because it reduces the number of elements in the HashMap. I may misunderstand what Oscar is saying, but I don't seem to be the only one: kdgregory, delfuego, Nash0, and I all seem to share the same (mis)understanding.

If I understand what Oscar is saying about the same class with the same hashcode, he's proposing that only one instance of a class with a given hashcode will be inserted into the HashMap. Eg, if I have an instance of SomeClass with a hashcode of 1 and a second instance of SomeClass with a hashcode of 1, only one instance of SomeClass is inserted.

The Java pastebin example at http://pastebin.com/f20af40b9 seems to indicate the above correctly summarizes what Oscar is proposing.

Regardless of any understanding or misunderstanding, what happens is different instances of the same class do not get inserted only once into the HashMap if they have the same hashcode - not until it's determined whether the keys are equal or not. The hashcode contract requires that equal objects have the same hashcode; however, it doesn't require that unequal objects have different hashcodes (although this may be desirable for other reasons)[1].

The pastebin.com/f20af40b9 example (which Oscar refers to at least twice) follows, but modified slightly to use JUnit assertions rather than printlines. This example is used to support the proposal that the same hashcodes cause collisions and when the classes are the same only one entry is created (eg, only one String in this specific case):

@Test
public void shouldOverwriteWhenEqualAndHashcodeSame() {
    String s = new String("ese");
    String ese = new String("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // AND equal
    assertTrue(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(2, map.size());

    assertEquals(2, map.get("ese"));
    assertEquals(3, map.get(some));

    assertTrue(s.equals(ese) && s.equals("ese"));
}

class SomeClass {
    public int hashCode() {
        return 100727;
    }
}

However, the hashcode isn't the complete story. What the pastebin example neglects is the fact that both s and ese are equal: they are both the string "ese". Thus, inserting or getting the contents of the map using s or ese or "ese" as the key are all equivalent because s.equals(ese) && s.equals("ese").

A second test demonstrates it is erroneous to conclude that identical hashcodes on the same class is the reason the key -> value s -> 1 is overwritten by ese -> 2 when map.put(ese, 2) is called in test one. In test two, s and ese still have the same hashcode (as verified by assertEquals(s.hashCode(), ese.hashCode());) AND they are the same class. However, s and ese are MyString instances in this test, not Java String instances - with the only difference relevant for this test being the equals: String s equals String ese in test one above, whereas MyStrings s does not equal MyString ese in test two:

@Test
public void shouldInsertWhenNotEqualAndHashcodeSame() {
    MyString s = new MyString("ese");
    MyString ese = new MyString("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // BUT not equal
    assertFalse(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(3, map.size());

    assertEquals(1, map.get(s));
    assertEquals(2, map.get(ese));
    assertEquals(3, map.get(some));
}

/**
 * NOTE: equals is not overridden so the default implementation is used
 * which means objects are only equal if they're the same instance, whereas
 * the actual Java String class compares the value of its contents.
 */
class MyString {
    String i;

    MyString(String i) {
        this.i = i;
    }

    @Override
    public int hashCode() {
        return 100727;
    }
}

Based on a later comment, Oscar seems to reverse what he's said earlier and acknowledges the importance of equals. However, it still seems the notion that equals is what matters, not the "same class", is unclear (emphasis mine):

"Not really. The list is created only if the hash is the same, but the key is different. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because String.equals( Integer ) is false. But if you have the same class ( or at least .equals returns true ) then the same entry is used. For instance new String("one") and `new String("one") used as keys, will use the same entry. Actually this is the WHOLE point of HashMap in first place! See for yourself: pastebin.com/f20af40b9 – Oscar Reyes"

versus earlier comments that explicitly address the importance of identical class and same hashcode, with no mention of equals:

"@delfuego: See for yourself: pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries. – Oscar Reyes"

or

"Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading. – Oscar Reyes"

or

"@kdgregory: Yes, but only if the collision happens with different classes, for the same class ( which is the case ) the same entry is used. – Oscar Reyes"

Again, I may misunderstand what Oscar was actually trying to say. However, his original comments have caused enough confusion that it seems prudent to clear everything up with some explicit tests so there are no lingering doubts.


[1] - From Effective Java, Second Edition by Joshua Bloch:

  • Whenever it is invoked on the same object more than once during an execution of an application, the hashCode method must consistently return the same integer, provided no information used in equal s comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

  • If two objects are equal according to the equal s(Obj ect) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equal s(Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

Share:
125,974

Related videos on Youtube

pepe450
Author by

pepe450

Updated on July 08, 2022

Comments

  • pepe450
    pepe450 almost 2 years

    I want to create a large HashMap but the put() performance is not good enough. Any ideas?

    Other data structure suggestions are welcome but I need the lookup feature of a Java Map:

    map.get(key)

    In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

    Also, does anyone know if using different hash code distributions for the keys could help?

    My hashcode method:

    byte[] a = new byte[2];
    byte[] b = new byte[3];
    ...
    
    public int hashCode() {
        int hash = 503;
        hash = hash * 5381 + (a[0] + a[1]);
        hash = hash * 5381 + (b[0] + b[1] + b[2]);
        return hash;
    }
    

    I am using the associative property of addition to ensure that equal objects have the same hashcode. The arrays are bytes with values in the range 0 - 51. Values are only used once in either array. The objects are equal if the a arrays contain the same values (in either order) and the same goes for the b array. So a = {0,1} b = {45,12,33} and a = {1,0} b = {33,45,12} are equal.

    EDIT, some notes:

    • A few people have criticized using a hash map or other data structure to store 26 million entries. I cannot see why this would seem strange. It looks like a classic data structures and algorithms problem to me. I have 26 million items and I want to be able to quickly insert them into and look them up from a data structure: give me the data structure and algorithms.

    • Setting the initial capacity of the default Java HashMap to 26 million decreases the performance.

    • Some people have suggested using databases, in some other situations that is definitely the smart option. But I am really asking a data structures and algorithms question, a full database would be overkill and much slower than a good datastructure solution (after all the database is just software but would have communication and possibly disk overhead).

    • Pascal Cuoq
      Pascal Cuoq over 14 years
      If HashMap becomes slow, in all likelihood your hash function is not good enough.
    • GregB
      GregB over 14 years
      Are you specifying the proper capacity and load factor in the map constructor? Did you implement the hashCode method of the objects you are inserting?
    • skaffman
      skaffman over 14 years
      doctor, it hurts when I do this
    • Henning
      Henning over 14 years
      Would you mind posting your hashCode() function?
    • ReneS
      ReneS over 14 years
      If setting the initial capacity decreases performance, I bet on memory problems, not hashmap problems. The initial large array occupies more memory in the beginning and therefore does not permit to insert/create objects earlier. Can you share some server details and memory settings?
    • ReneS
      ReneS over 14 years
      What are your keys? Strings? Integers?
    • Peter Recore
      Peter Recore over 14 years
      how well does the get() performance scale? if you have a bad hash function, resulting in clusters, I would expect the get performance to also be poor.
    • pepe450
      pepe450 over 14 years
      I added the hashcode method, after reading the answers and comments I have a feeling it is to blame. I was hoping a naive method like the one above would work but I guess not. I will try some variations and report back later.
    • delfuego
      delfuego over 14 years
      Not knowing what's contained in the a and b arrays, it's hard to know whether your hashCode() method gives you sufficiently random coverage...
    • ReneS
      ReneS over 14 years
      If you have large arrays which start with the same values, the will end up in the same hash bucket. Check out java.lang.String.hashCode()
    • Juha Syrjälä
      Juha Syrjälä over 14 years
      What are the contents of a -array? How long is the a array? Should you use more than 3 first elements?
    • OscarRyz
      OscarRyz over 14 years
      You could also try to allocate more initial memory. The JVM will try to garbage collect dead objects and will request more and more memory each time if your initial and maximum are too different. See my answer, I have inserted 26M ( of integers I know ) in 16s I know it is not the same, but may be a good start for troubleshooting.
    • Juha Syrjälä
      Juha Syrjälä over 14 years
      What kind of equals() method you have for key objects?
    • notnoop
      notnoop over 14 years
      If your object is immutable, have you tried caching hashCode?
    • OscarRyz
      OscarRyz over 14 years
      @Nash0. It is NOT the HashMap where you're having problems, is somewhere else. I have implemented a small class with your hashCode and there is a big number of collisions ( which may be good, because you're finding repeated objects ) The greater the number of collisions the map uses less entries which increases it's performance. See my edit below. You have to profile and findout the problem is somewhere else ( perhaps in the item creation )
    • kdgregory
      kdgregory over 14 years
      @Nash0 - I commented on MAK's answer, but will repeat here: you apparently have the requirement that any permutation of array values will give the same hashcode (eg, {1,2} is the same as {2,1}). This dramatically reduces the space of your hashcodes, and as a result a hashed structure is not appropriate for your needs.
    • oxbow_lakes
      oxbow_lakes over 14 years
      This is a really good question; a nice demonstration of why hashing algorithms matter and what affects they may have on performance
    • Vishy
      Vishy over 14 years
      The sum of the a's has a range of 0 to 102 and sum of b's have a range of 0 to 153 so you have only 15,606 possible hash values and an average of 1,666 keys with the same hashCode. You should change your hashcode so the number of possible hashCodes is much greater than the number of keys.
    • bacar
      bacar over 11 years
      I've psychically determined you're modelling Texas Hold 'Em Poker ;-)
    • interestedparty333
      interestedparty333 almost 7 years
      @notnoop The objects need to be immutable if they're being put in a HashMap. Even if an object changes, it will remain in the same bucket. As a result, if the object changes, it makes it much harder to get to it again -- You can get to it by looking up a copy of the old key, but it will fail the equality check.
  • Henning
    Henning over 14 years
    I don't think they are recomputed, ever. The table size is increased, the hashes are kept.
  • Henning
    Henning over 14 years
    Hashmap just does a bit-wise and for every entry: newIndex = storedHash & newLength;
  • Jay
    Jay over 14 years
    Hanning: Perhaps poor wording on delfuego's part, but the point is valid. Yes, the hash values are not recomputed in the sense that the output of hashCode() is not recalculated. But When the table size is increased, all the keys must be re-inserted into the table, that is, the hash value must be re-hashed to get a new slot number in the table.
  • delfuego
    delfuego over 14 years
    Jay, yep -- poor wording indeed, and what you said. :)
  • Henning
    Henning over 14 years
    Jay: But that's a relatively cheap operation, and as the size doubles every time, this can under no circumstances explain poor performance.
  • ReneS
    ReneS over 14 years
    Oh, another thing. Your hash codes have to be evenly distributed to avoid large linked lists at single positions in the map.
  • pepe450
    pepe450 over 14 years
    The strange thing is that setting the inital capacity equal to the number of elements I want to insert actually decreases the performance.
  • Henning
    Henning over 14 years
    Yup, a bad hash function would result in this kind of behavior. +1
  • Jay
    Jay over 14 years
    @Henning: Well, the AND function itself is cheap, but you do have to loop through all the slots in the table, then chase the linked list on every slot, and then for each entry do the AND (plus a "length-1" -- seems to me they should have cached the subtraction, but whatever). None of that is expensive of itself, but doing it for millions of entries could add up. I haven't done any benchmarks, but I'd think it's worth looking at.
  • Henning
    Henning over 14 years
    @Jay: I'm not against properly initializing the capacity, quite to the contrary. But if you start with the default capacity (16), you will only have to increase about 20 times for 26 million entries. That's a few million AND and copy operations total, sure, but could that explain a significant drop in performance to the point where it's noticeable, let alone unbearable? The bad hashCode() scenario you describe in your answer sounds much better.
  • delfuego
    delfuego over 14 years
    As is pointed out below, there's no recomputation of the hashcodes of objects in a HashMap when it's resized, so this doesn't gain you anything.
  • Juha Syrjälä
    Juha Syrjälä over 14 years
    Even if the keys are random, you could use (key.hashCode() % 28) to select a map where to store that key-value.
  • ReneS
    ReneS over 14 years
    RAM might be way to small for these kind of maps and arrays, so I already suspected a memory limitation problem.
  • pepe450
    pepe450 over 14 years
    Although I should also add that it won't work for me because I want the property that arrays with the same elements in different orders give the same hashcode.
  • kdgregory
    kdgregory over 14 years
    In that case, you have 52C2 + 52C3 hashcodes (23426 according to my calculator), and a hashmap is very much the wrong tool for the job.
  • OscarRyz
    OscarRyz over 14 years
    Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading.
  • kdgregory
    kdgregory over 14 years
    @Oscar - more collisions equals more work to do, because now you have to do a linear search of the hash chain. If you have 26,000,000 distinct values per equals(), and 26,000 distinct values per hashCode(), then the bucket chains will have 1,000 objects each.
  • delfuego
    delfuego over 14 years
    Oscar, as pointed out elsewhere above (in response to your comments), you seem to be assuming that more collisions is GOOD; it's very much NOT good. A collision means that the slot at a given hash goes from containing a single entry to containing a list of entries, and this list has to be searched/traversed every time the slot is accessed.
  • oxbow_lakes
    oxbow_lakes over 14 years
    Berkeley DB is nowhere near fast enough for this number of entries due to the serialization/IO overhead; it could never be faster than a hashmap and the OP doesn't care about persistence. Your suggestion is not a good one.
  • MAK
    MAK over 14 years
    @Nash0: You appear to be saying that you want these to have the same hashCode but at the same time not being equal (as defined by the equals() method). Why would you want that?
  • OscarRyz
    OscarRyz over 14 years
    @delfuego: Not really, that happens only when you have a collision using different classes but for the same class the same entry is used ;)
  • OscarRyz
    OscarRyz over 14 years
    @delfuego: See for yourself: pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries.
  • OscarRyz
    OscarRyz over 14 years
    @kdgregory: Yes, but only if the collision happens with different classes, for the same class ( which is the case ) the same entry is used.
  • OscarRyz
    OscarRyz over 14 years
    @kdgregory: I've answered about the "more collisions implies more work" somewhere else :)
  • OscarRyz
    OscarRyz over 14 years
    @delfuego and @nash0: Yeap, setting the initial capacity equal to the number of elements decreases the performance because you're having tons of millions of collisions and thus you're only using a small amount of that capacity. Even if you use all the available entries, setting the same capacity will make it worst!, because due to the load factor more space will be requested. You'll have to use initialcapactity = maxentries/loadcapacity ( such as 30M ,0.95 for 26M entries ) but this is NOT your case, since you're having all those collisions you're using only about 20k or less.
  • OscarRyz
    OscarRyz over 14 years
    Not really. The list is created only if the hash is the same, but the key is different. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because String.equals( Integer ) is false. But if you have the same class ( or at least .equals returns true ) then the same entry is used. For instance new String("one") and `new String("one") used as keys, will use the same entry. Actually this is the WHOLE point of HashMap in first place! See for yourself: pastebin.com/f20af40b9
  • pepe450
    pepe450 over 14 years
    Actually I am adding 26 millions items, the number of items is (52 choose 5) * (5 choose 3) =~ 26 million. I am sorry but my loop is not doing much besides calling put(), it is the HashMap performance.
  • pepe450
    pepe450 over 14 years
    Good point Oscar, it is the hash method that needs to be fixed.
  • pepe450
    pepe450 over 14 years
    I think I see what you mean by "not storing 26 millions objects", I am storing 26 million but they are only going in 20k hash buckets. From my experimentation that it true.
  • OscarRyz
    OscarRyz over 14 years
    @Nash0: MMhh I see, yeah you should know you code much better. I just can tell about my findings. I think you should definitely take a profiler, and let it tell you where the problem is. I still think it is not the hash map, but of course I don't have your code. Here's the code for my test if you find them useful. pastebin.com/f7a6cd5a5 Using your hashcode for 26M with no initial capacity runs in 8s
  • OscarRyz
    OscarRyz over 14 years
    That's the number I've got also 20K buckets, they are not really being stored.. wait a minute? Are you using something else than ram? Because if your serializing somehow that's where the performance is going to.
  • kdgregory
    kdgregory over 14 years
    @Oscar - I'm not sure I follow you: how does class matter? A HashMap works by picking a bucket chain by dividing the hashcode value by the number of buckets in the map (the "table" variable in Sun's implementation). Each bucket holds a linked list of entries, and the map walks that list calling equals() on every node.
  • kdgregory
    kdgregory over 14 years
    @Oscar - see my response to you with MAK's answer. HashMap maintains a linked list of entries at each hash bucket, and walks that list calling equals() on every element. The object's class has nothing to do with it (other than a short-circuit on equals()).
  • kdgregory
    kdgregory over 14 years
    @Oscar - Reading your answer it seems that you are assuming that equals() will return true if the hashcodes are the same. This is not part of the equals/hashcode contract. If I've misunderstood, ignore this comment.
  • pepe450
    pepe450 over 14 years
    Thank you very much for the effort Oscar but I think you are confusing the key objects being equal vs having the same hash code. Also in one of your code links you are using equals strings as the key, remember that strings in Java are immutable. I think we both learned a lot about hashing today :)
  • pepe450
    pepe450 over 14 years
    That is a very good idea, and in fact I am doing that for another data storage issue with 1.2 billion elements. In this case I wanted to take the easy way out and use a premade data structure :)
  • MAK
    MAK over 14 years
    @Nash0: To followup up last comment, if you cannot change the hashCode method, I think using a TreeMap would give you better performance.
  • OscarRyz
    OscarRyz over 14 years
    @Nash0: OOoook, yeap, we learned a lot :) Now I got it, so you're returning the same hash for objects that are NOT equals, right? Yeap I miss that point. The javadoc says: It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables. java.sun.com/javase/6/docs/api/java/lang/Object.html#hashCod‌​e()
  • Jay
    Jay over 14 years
    @Oscar: See my reply appended to my original post.
  • palantus
    palantus over 14 years
    I suggest not modifying the arrays in the hashCode method. By convention, hashCode does not change the object's state. Perhaps the constructor would be a better place to sort them.
  • rsp
    rsp over 14 years
    I agree that the sorting of the arrays should happen in the constructor. The code shown never seems to set the hashCode. Calculating the code can be done simpler as follows: int result = a[0]; result = result * 52 + a[1]; //etc.
  • Trevor Harrison
    Trevor Harrison over 14 years
    @kdgregory: I get 512*768 = 391680 possible hashes. Am I smoking dope?
  • pepe450
    pepe450 over 14 years
    I agree that sorting in the constructor and then calculating the hash code as mmyers and rsp suggest is better. In my case my solution is acceptable and I wanted to highlight the fact that the arrays must be sorted for hashCode() to work.
  • NateS
    NateS over 13 years
    Note you could also cache the hashcode (and invalidate appropriately if your object is mutable).
  • Tahir Akhtar
    Tahir Akhtar almost 12 years
    I know this is a very old thread, but here is a reference for term "collision" as it relates to hash codes: link . When you replace a value in hashmap by putting another value with same key, it is not called collision
  • Jay
    Jay almost 12 years
    @Tahir Exactly. Perhaps my post was poorly worded. Thanks for the clarification.
  • Anil Sharma
    Anil Sharma about 10 years
    so that means @nash is traversing through a linklist of size 1300 to fetch element in worst case scenario
  • jcsahnwaldt Reinstate Monica
    jcsahnwaldt Reinstate Monica over 6 years
    Just use java.util.Arrays.hashCode(). It's simpler (no code to write and maintain by yourself), its calculation is probably faster (fewer multiplications), and the distribution of its hash codes will probably be more even.
  • jcsahnwaldt Reinstate Monica
    jcsahnwaldt Reinstate Monica over 6 years
    Please read "Effective Java" by Josh Bloch. It explains (and helps you avoid or solve) this problem (and many others).
  • Benj
    Benj over 4 years
    Could you please share some link or implementation ?
  • ThrowsError
    ThrowsError over 3 years
    Why did you choose 52?
  • pepe450
    pepe450 over 3 years
    @JavaMan They are representing cards so there are only 52 options and by using powers of 52 I guarantee that every combination will get a unique hash code.
  • ThrowsError
    ThrowsError over 3 years
    @nash Thank you so much for this beautiful explication