Why we use Hash Code in HashTable instead of an Index?

14,390

Solution 1

Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.

Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.

This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.

You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.

There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.

Solution 2

A HashCode is a pseudo unique key. We would like to have a really unique key but that's not feasible. We settle for a fast and safe (no exceptions) function.

A HashTable uses the HashCode to do a lookup in O(1) time initially. Any indexing scheme requires O(log(n)) time. But with an inefficient HashCode function the collision handling can make the HashTable a lot slower.

In .NET there is a default implementation for GetHashCode, but types can override this.

the System.String overrides GetHashCode() because it overrides Equals() and then GetHashCode has to be kept consistent.

Solution 3

Answering each one of your questions directly:

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

An integer hash is generated by whatever method is appropriate for the object. The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.

In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

There are many ways this can be done. Here's an example I'm thinking of on the spot:

int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
    hash ^= theString[i];
}

This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.

How searching for specific key in a hash table is speeded up using hash code? What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate

 hash_value MOD hash_table_size. 

Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

Share:
14,390
Jaywith.7
Author by

Jaywith.7

Currently, I'm following BSc. Software Engineering degree (University of Westminster). I'm working and learning C#.NET.

Updated on June 13, 2022

Comments

  • Jaywith.7
    Jaywith.7 almost 2 years
    • How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

    • In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

    • How searching for specific key in a hash table is speeded up using hash code?

    • What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

    Can someone help?

  • Rob Kennedy
    Rob Kennedy about 15 years
    The hash is not "more or less random"; it's just less. So less random as to not be random at all. A better word would be "arbitrary." And by saying the hash is "unique to that data," you DO "guarantee that different data won't give the same hash." And since that's obviously false, "unique" is not the right word.
  • BobMcGee
    BobMcGee about 15 years
    I mean random, as in there's no predictable order to the keys from a hashcode vs. indexes in a List being assigned in order. I'll try to clarify my points by rephrasing that.