Hash table vs Balanced binary tree

algorithm language-agnostic data-structures hash tree

41,334

Solution 1

This question cannot be answered, in general, I fear.

The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.

So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.

For a more elaborate answer, let's consider some alternatives.

Hash Table (see Wikipedia's entry for some basics)

Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...

Binary Tree

Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"

Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...

Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.

Solution 2

Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.

Solution 3

A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.

In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.

On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.

Binary trees are easier to implement in purely functional languages.

Binary trees have a natural sort order and a natural way to walk the tree for all elements.

When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.

Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).

Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.

Solution 4

A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.

If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.

The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.

Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.

These are factors to take into account in deciding which implementation to use:

Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.

Solution 5

Hash tables are faster lookups:

You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).

Binary trees:

Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.

View more solutions

41,334

peoro

Apparently, this user has nothing to say.

Updated on May 31, 2020

Comments

peoro almost 4 years

What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?
- Ciro Santilli OurBigBook.com over 6 years
  
  stackoverflow.com/questions/4128546/…
Admin about 13 years

While not maintaining sorting, hash-tables that can maintain (insert) order are somewhat trivial.
Daniel Egeberg about 13 years

What do you mean when you say that binary trees are deterministic? Hash tables are deterministic as well. Also, operations on binary trees are O(h) where h is the height. If it's a balanced binary tree, then h=O(log(n)).
whitey04 about 13 years

Not true! Hash tables can "miss". For instance if you have an array of 10 and use a phone number to index into it (use a modulo for instance) you could get a hash that points you to the first element of the array. However, if when the array was built 9 other numbers with the same hash were used first; you actually have to go all the way to the last element. In a binary search you are guaranteed to get BigO(log n) no matter what. !DISCLAIMER! It all depends on how you build up your hash sort/search. There are many ways...
peoro about 13 years

That's not so easy. I'm afraid of a couple of things: 1. hash tables have got bad performance (O(n)) at the worst case 2. in order to resize the hash table I've got to rehash anything, this is pretty expensive. This question is to know how can I avoid such points and to be informed about the other issues I'm missing.
supercat about 13 years

pst: Maintaining insert order is possible with almost any 'black-box' collection; to what extent can one maintain sort order with a hash table better than with a 'black-box'?
MAK about 13 years

Adding an element in the middle does not mean moving the rest around. Its a linked data structure, not an array (maybe you are confusing Binary Search Tree with Binary Search which are two very different things. All operations are O(log(n)), if adding/removing to the middle meant moving the rest it would have been O(n).
whitey04 about 13 years

It all depends on how you implement it... Using a linked tree is a good way to bypass the insertion problem of a binary search. However, The binary search (with a tree underneath it or not) will always return a result in O(log n). A hash can't unless the input key is 1:1 with the generated hash.
Haozhun about 13 years

@peoro: O(n) is not practically possible unless someone knows your implementation detail and simply wants to break you. Even consider the resize operation (happens at a reasonable interval), hash costs much less than balanced tree.
supercat about 13 years

@peoro: To amplify Gene's point, if each resize on a hash table doubles the size (pretty typical), then immediately after a resize, half the items will have been rehashed once, a quarter of them twice, an eighth of them three times, 1/16 of them four times, etc. so no matter how big the table gets, the average number of rehashes will be less than two. The point about degenerate hashing situations is a good one, though. For example, if one makes a Dictionary of a struct type without overriding GetHashCode, and many keys have the same value in the first field, performance will be bad.
supercat about 13 years

The worst-case time for adding an element to a hash table is O(n) because of the resize, but if a hash table doubles in size each time, the fraction of additions that require a rehash will drop as the table size increases. The average number of rehash operations per element will never exceed two, no matter how big the table gets.
David Weiser about 13 years

If the hash table size is doubling, then I'd be surprised if the number of collisions decreased because hash tables work best (i.e a low number of collisions) when the size of the table is prime. Also, if you're asking the system to give you twice as much memory each time you resize, you'll quickly run out of memory (or slow the system down if the system rearranges its memory to give you the amount of contiguous memory you're asking for).
Matthieu M. about 13 years

@peoro: rehashing is usually not necessary (the hash can be stored alongside the data), growing the hash table can be done "online" (migration phase), and the O(n) worst case can be alleviated by the hash table structure (and there are many to choose from).
rlibby about 13 years

doubling is a common strategy but it isn't required. What is required is exponential growth. You can pick a smaller exponent if you like, it will just mean that the average number of rehash operations will be higher. In any case the amortized cost of n inserts in a table with exponential growth is O(n), while self-balancing binary search trees cost O(n*log(n)).
rlibby about 13 years

"A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function." This is misleading. A binary search tree could always just use the same keys as the hash table: hash values. It is not a restriction on cases when trees may be used, compared to hash tables.
Apalala about 13 years

@rlibby Though most implementations of hash keys by default use types on which a total order is defined (integers or pointers), only equivalence is required if you provide your own hashes. So, in general, you cannot use a binary search tree upon hash keys, because you don't know what the hashes are, where they came from, or much less if they support a total order relationship.
rlibby about 13 years

but if I'm understanding your suggestion correctly, then such a hash value also cannot be used in a hash table. Surely if it can be used in a hash table then it can also be used in a tree set. If it can be used in a table, then it must map to some index in the table. One could use the function that generates this index to generate keys for the tree set.
supercat almost 12 years

@MatthieuM.: The O(n) worst case for a hash-table is unavoidable unless the number of different values an element might take is limited). Otherwise, there's no way any hash table can achieve better than O(n) behavior if the elements it is given are all indistinguishable except by an equality-check operation (i.e. the all have the same hash code, and no ranking comparisons, partial-bit scans, or other such operations are available).
Matthieu M. almost 12 years

@supercat: you're right, of course. However a "simple" hashtable is just an array of linked-lists, and linked-lists have a tendancy to degenerate quickly while if you use an array of open-addressed hashtables you have less chance to get to that worst-case. This is what I meant by alleviated; not that it was completely out of the picture, but that by choosing a different structure you could dimish the chances of it happening greatly.
supercat almost 12 years

@MatthieuM.: Some hash tables are prone to degenerate into linear performance if the number of items gets close to the table size, no matter how good the hash function is. Effective algorithm design can reduce that danger. A hash table is going to be doomed to O(n) performance if the hash function is bad, though, no matter what sort of algorithm the table tries to use.
Apalala over 11 years

@rlibby A hash table requires that elements that are equal have the same hash, but it doesn't require that elements that are different have different hashes. If different elements have the same hash, then there's no total-order relationship.
Matthieu M. over 10 years

@ProfVersaggi: Actually, that's not even true; some hash tables handle duplicates poorly, but some do well. I advise you to read Joaquín M López Muñoz's entries on the topic. He authored, and is maintaining, Boost MultiIndex.
Erik Alapää about 9 years

In C++, the std::map is a red-black tree, and std::unordered_map is the hash table. I usually prefer std::map since I am guaranteed I will never run into trouble with re-hashing that can e.g. destroy hard real-time systems.
SOFe about 5 years

If different elements often have the same hashes, the hashtable would be very slow anyway. It is indeed possible to store a linked list on each node of a binary tree just like a linked list on each entry of a hash table.