How std::unordered_map is implemented

57,093

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:

  • the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
  • the guarantee that the table will not be rehashed unless grown beyond that load factor.

That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.

References:

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.

Regarding the text you quote:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.

As evidence of the awareness, from Matthew Austern's proposal here:

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

• It's necessary to distinguish between a vacant position and an occupied one.

• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

Share:
57,093
ralzaul
Author by

ralzaul

Sexiest c++ coder alive after bjarne stroustrup.

Updated on July 08, 2022

Comments

  • ralzaul
    ralzaul almost 2 years

    c++ unordered_map collision handling , resize and rehash

    This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

    Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

    I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

  • Jerry Coffin
    Jerry Coffin almost 7 years
    "... open hashing / chaining ... handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do." Actually, it does gradually degrade performance. What it avoids is the precipitous performance loss typical of most closed hashing. Either way, the performance degrades from O(1) to O(N), but with closed hashing, that happens from roughly 90% full to 100% full, whereas with open hashing it's typically from (say) 100% full to 1000% full (i.e., you've inserted ten times as many items as there are slots in the table).
  • Jerry Coffin
    Jerry Coffin almost 7 years
    You can also do open hashing with a balanced tree instead of a list for each bucket. In this case, the degradation is from O(1) to O(log N). In this case, even with 10x as many items as slots in the table, there's still only minimal performance degradation (with the proviso that it's generally somewhat slower, even when minimally utilized).
  • Tony Delroy
    Tony Delroy almost 7 years
    @JerryCoffin: true - I've heard that's what (one of?) the key Java implementations does, but it would adversely affect iteration over the full container compared to the singly linked list approach GCC uses (outlined in the answer).
  • Tony Delroy
    Tony Delroy about 6 years
    @JerryCoffin: my reply above addressed your second comment, but just noticed I never addressed your first. Regarding "Actually, it does gradually degrade...from (say) 100% full to 1000% full" - that doesn't happen due to the max load factor - the table resizes at 100%. But, you're not talking about the performance issue with open hashing that I was alluding to either, which is that when you erase an element you either have to mark the bucket as having been in use and keep searching past it (small lasting cost), or "compact" a collision chain into the bucket (larger up-front cost).
  • underscore_d
    underscore_d over 5 years
    @MohitShah Nonsense. It's an extremely good answer, which reasons clearly and - crucially - cites the original proposal. If you had trouble reading it, that's your problem, not the answer's.