Why can hashCode() return the same value for different objects in Java?

33,793

Solution 1

hashing an object means "finding a good, descriptive value (number) that can be reproduced by the very same instance again and again". Because hash codes from Java's Object.hashCode() are of type int, you can only have 2^32 different values. That's why you will have so-called "collisions" depending on the hashing algorithm, when two distinct Objects produce the same hashCode.

Typically, this does not produce any problems, because hashCode() is mostly used together with equals(). For instance, a HashMap will call hashCode() upon its keys, to know whether the keys may already be contained in the HashMap. If the HashMap does not find the hash code, it's obvious the key is not contained in the HashMap yet. But if it does, it will have to double-check all keys having that same hash code using equals().

I.e.

A.hashCode() == B.hashCode() // does not necessarily mean
A.equals(B)

But

A.equals(B) // means
A.hashCode() == B.hashCode()

If equals() and hashCode() are implemented correctly.

For a more precise description of the general hashCode contract, see the Javadoc.

Solution 2

There are only just over 4 billion possible hashcodes (the range of an int) , but the number of objects you could choose to create is much larger. Therefore some objects must share the same hash code, by the pigeonhole principle.

For example the number of possible strings containing 10 letters from A-Z is 26**10 which is 141167095653376. It is impossible to assign all of these strings a unique hash code. Nor is it important - the hash code doesn't need to be unique. It just needs to have not too many collisions for real data.

Solution 3

The idea of a hashtable is that you want to be able to realize a datastructure called a dictionary in an efficient way. A dictionary is a key/value store, i.e., you want to be able to store certain objects under a certain key and later on be able to retrieve them again using the same key.

One of the most efficient ways to access values is to store them in an array. For instance, we could realize a dictionary that uses integers for keys and Strings for values like so:

String[] dictionary = new String[DICT_SIZE];
dictionary[15] = "Hello";
dictionary[121] = "world";

System.out.println(dictionary[15]); // prints "Hello"

Unfortunately, this approach is not very general at all: the index of an array has to be an integer value, but ideally we'd like to be able to use arbitrary kinds of objects for our keys, not only integers.

Now, the way to solve this point is to have a way of mapping arbitrary objects to integer values which we could then use as keys for our array. In Java, that's what hashCode() does. So now, we could try to implement a String->String dictionary:

String[] dictionary = new String[DICT_SIZE];
// "a" -> "Hello"
dictionary["a".hashCode()] = "Hello";

// "b" -> "world"
dictionary["b".hashCode()] = "world";

System.out.println(dictionary["b".hashCode()]); // prints world

But hey, what if there is some object which we'd like to use as a key, but its hashCode method returns a value that's greater than or equal to DICT_SIZE? Then we'd get an ArrayIndexOutOfBoundsException and that would be undesirable. So, let's just make it as big as we can, right?

public static final int DICT_SIZE = Integer.MAX_VALUE // Ooops!

But that would mean that we would have to allocate ginormeous amounts of memory for our array, even if we only intend to store a few items. So that can't be the best solution, and in fact we can do better. Let's assume we had a function h that for any given DICT_SIZE maps arbitrary integers into the range [0, DICT_SIZE[. Then we could just apply h to whatever the hashCode() method of a key object returns and be certain that we stay in the boundaries of the underlying array.

public static int h(int value, int DICT_SIZE) {
    // returns an integer >= 0 and < DICT_SIZE for every value.
}

That function is called a hash function. Now we can adapt our dictionary implementation to avoid the ArrayIndexOutOfBoundsException:

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)] = "Hello"

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)] = "world"

But that introduces another problem: what if h maps two different key indices to the same value? For instance:

int keyA = h("a".hashCode(), DICT_SIZE);
int keyB = h("b".hashCode(), DICT_SIZE);

may yield the same values for keyA and keyB, and in that case we would accidentally overwrite a value in our array:

// "a" -> "Hello"
dictionary[keyA] = "Hello";

// "b" -> "world"
dictionary[keyB] = "world"; // DAMN! This overwrites "Hello"!!

System.out.println(dictionary[keyA]); // prints "world"

Well, you may say, then we just have to make sure that we implement h in such a way that this can never happen. Unfortunately, this isn't possible in general. Consider the following code:

for (int i = 0; i <= DICT_SIZE; i++) {
    dictionary[h(i, DICT_SIZE)] = "dummy";
}

This loop stores DICT_SIZE + 1 values (always the same value, actually, namely the String "dummy") in the dictionary. Mhh, but the array can only store DICT_SIZE different entries! That means, when we use h, we would overwrite (at least) one entry. Or in other words, h will map two different keys to the same value! These "collisions" can't be avoided: if n pigeons try to go into n-1 pigeon holes, at least two of them have to go into the same hole.

But what we can do is to extend our implementation so that the array can store multiple values under the same index. This can easily be done by using lists. So instead of using:

String[] dictionary = new String[DICT_SIZE];

we write:

List<String>[] dictionary = new List<String>[DICT_SIZE];

(Side remark: note that Java doesn't allow the creation of arrays of generic types, so the above line wouldn't compile -- but you get the idea).

That will change the access to the dictionary as follows:

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)].add("Hello");

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)].add("world");

In case our hashfunction h returns different values for all our keys, this will result in lists with only one element each, and retrieving elements is really simple:

System.out.println(dictionary[h("a".hashCode(), DICT_SIZE)].get(0)); // "Hello"

But we already know that in general h will map different keys to the same integer sometimes. In these cases, the lists will contain more than one value. For retrieval, we have to go through the whole list to find the "correct" value, but how would we recognize it?

Well, instead of storing the value alone, we could always store the complete (key,value) pair in the lists. Then lookup would be performed in two steps:

  1. Apply the hashfunction to retrieve the correct list from the array.
  2. Iterate through all pairs stored in the retrieved list: if the pair with the desired key is found, return the value from the pair.

Now adding and retrieving have become so complex that it's not indecent to treat ourselves separate methods for these operations:

List<Pair<String,String>>[] dictionary = List<Pair<String,String>>[DICT_SIZE];

public void put(String key, String value) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex == null) {
        listAtIndex = new LinkedList<Pair<Integer,String>>();
        dictionary[arrayIndex] = listAtIndex;
    }

    for (Pair<String,String> previouslyAdded : listAtIndex) {
        if (previouslyAdded.getKey().equals(key)) {
            // the key is already used in the dictionary,
            // so let's simply overwrite the associated value
            previouslyAdded.setValue(value);
            return;
        }
    }

    listAtIndex.add(new Pair<String,String>(key, value));
}

public String get(String key) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex != null) {
        for (Pair<String,String> previouslyAdded : listAtIndex) {
            if (previouslyAdded.getKey().equals(key)) {
                return previouslyAdded.getValue(); // entry found!
            }
        }
    }

    // entry not found
    return null;
}

So, in order for this approach to work, we actually need two comparison operations: the hashCode method to find the list in the array (this works fast if hashCode() and h are both fast) and an equals method which we need when going through the list.

This is the general idea of hashing, and you will recognize the put and get method from java.util.Map. Of course, the above implementation is an oversimplification, but it should illustrate the gist of it all.

Naturally, this approach is not limited to Strings, it works for all kinds of objects, since the methods hashCode() and equals are members of the top-level class java.lang.Object and all other classes inherit from that one.

As you can see, it doesn't really matter if two distinct objects return the same value in their hashCode() method: the above approach will always work! But still it is desirable that they return different values to lower the chances for hash collisions produced by h. We have seen that these can't be avoided 100% in general, but the less collisions we get, the more efficient our hashtable becomes. In the worst case, all keys map to the same array index: in that case, all pairs are stored in a single list and finding a value will then become an operation with costs linear in the size of the hashtable.

Solution 4

The hashCode() value can be used to quickly find an object by using the hash code as an address to a hash table bucket where it is stored.

If multiple objects return the same value from hashCode(), it means that they would be stored in the same bucket. If many objects are stored in the same bucket it means that on average it requires more comparison operations to look up a given object.

Instead use equals() to compare two objects to see whether they are semantically equal.

Share:
33,793
Eugene
Author by

Eugene

Updated on July 09, 2022

Comments

  • Eugene
    Eugene almost 2 years

    A quote from the book I'm reading Head First Java:

    The point is that hashcodes can be the same without necessarily guaranteeing that the objects are equal, because the "hashing algorithm" used in the hashCode() method might happen to return the same value for multiple objects.

    Why might the hashCode() method return the same value for different objects? Does that not cause problems?

  • Thomas
    Thomas over 13 years
    @Lukas Eder: Your answer was not only way more concise (and still correct and easy to understand), you also got way more credit than my tl;dr answer ;-)
  • Eugene
    Eugene over 13 years
    Unbelievable! :) Marked as accepted one! Thank you very much again.
  • Thomas
    Thomas over 13 years
    @AndroidNoob: thanks for the checkmark, but I guess it's a little unfair now: Lukas Eder, whose answer you've accepted before, wrote an answer that's definitely correct too, and he just was a little faster than me. I think he should get his credit back...
  • Lukas Eder
    Lukas Eder over 13 years
    This is clearly the you-scratch-my-back-i'll-scratch-yours post! :)
  • Arjan Tijms
    Arjan Tijms over 13 years
    Over time you can maybe catch up in credits with the other answers. Gave you my vote too ;)
  • supercat
    supercat over 11 years
    If one is using a data structure which can tolerate duplicate hash codes, even if inefficiently, there is apt to be no practical difference between a hash code which would typically cause 100 items in a set of 10,000 to have hash codes that each match one other item in the set, versus one that rarely result in even one duplicate. A fast algorithm which achieves the former metric is apt to be more efficient than a slower algorithm that achieves the second.
  • Tundey
    Tundey over 11 years
    And how does your answer invalidates mine? It's still inefficient just more practical.
  • supercat
    supercat over 11 years
    If the with one algorithm, the average item in a hashed set shares a bucket with 0.1 other items, but an slightly-more-expensive algorithm could eliminate all collisions, the latter algorithm would only be more efficient if its extra cost was less than a tenth the cost of an extra comparison. If a hashing algorithm takes a significant amount of time, a complete lack of collisions could be a sign that a faster algorithm might be more efficient.
  • Tundey
    Tundey over 11 years
    So many ifs in those 2 statements...yes you are right but you are going to extreme lengths to point out a very easy point i.e. the time taken to guarantee no collisions may not be worth it. You could have just said that instead of inventing hypothetical algorithms that take just enough time to make it not worth while. Good grief.
  • supercat
    supercat over 11 years
    I interpreted your original answer as a blanket statement that efficient hash algorithms must map every distinct object to a different hashcode; any algorithm that fails to do so is inefficient. That statement is false. Efficient hash codes are expected to have occasional hash collisions; in many cases it's impossible to eliminate all collisions, and even when it's not impossible it's seldom worthwhile for anything but the simplest types.