Hash table - why is it faster than arrays?

performance algorithm hash hashtable

63,772

Solution 1

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).

The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).

Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?

You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).

In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).

Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.

I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".

Of course, the hash table analysis results can be proven by math.

Solution 2

With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.

With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.

Solution 3

Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.

Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.

Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.

Wikipedia provides very comprehensive description of hash table.

Solution 4

A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.

Whereas if it has been an array, it will have to traverse through the whole array to search for this element.

Solution 5

Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?

Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:

ideal bucket = hash % num_buckets

The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:

1) try another bucket

2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values

For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.

For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).

Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.

View more solutions

63,772

Author by

Johnny Pauling

Updated on October 07, 2020

Comments

Johnny Pauling over 3 years

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).

Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Zoran Pavlovic almost 12 years

Hash tables don't always use buckets.
bitfox almost 12 years

Thank you guys :-) @ZoranPavlovic: I write answers for passion. The upvotes looks like: "I like you reply!" and they help to emphasize the good answers for the community.
Isaac over 8 years

How does the load factor = n in the worst case? If the load_factor is n/k where n is the number of elements and k is the size of the array, won't the load_factor be 1 in the worst case, since you cant store more than k elements in the hash table?
Ryan McArthur about 8 years

I know this is an ancient topic but thank you for your answer! This simple straightforward explanation is what made it finally click back into place for me. I kept thinking to myself, why the heck do I want to hash these things when I can just retrieve them from a simple array... Well because maybe I don't know its index, duh smacks forehead. :).
Michael about 8 years

Thank you! I've known for years that Hash Tables stored the values in various buckets (ideally with one value per bucket), however in all the reading I've done, no one has ever explained that the location of the bucket itself is essentially determined by the object being stored - so the very act of rehashing the value lets you know where to look for it. I suddenly get why Hash Tables have the potential to be so much faster than arrays. Short but excellent answer.
Julien Zakaib over 7 years

Something I don't get : you say the worst case for hashtables is O(n) , where n is the total number of elements. However, the only circumstance in which you have to iterate linearly over elements is when there was a collision (in which case the bucket is some sort of array/linked list/tree). So are you saying that in the worst case, every single element in the hashtable caused a collision, and thus lookup is O(n) ? Is that even possible? Shouldn't it be O(m) where m is number of inserted elements that hash to the same value ?
Kyle Delaney about 6 years

This makes more sense to me than the accepted answer. Thanks.
Yousef over 4 years

You don't compare hashes when searching. You compute the hash which gives you the index in which the value resides, then you compare the values in the list you've found with the value you're looking for