How does C++ STL unordered_map resolve collisions?

c++ stl unordered-map

33,840

Solution 1

The standard defines a little more about this than most people seem to realize.

Specifically, the standard requires (§23.2.5/9):

The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket.

The interface includes a bucket_count that runs in constant time. (table 103). It also includes a bucket_size that has to run in time linear on the size of the bucket.

That's basically describing an implementation that uses collision chaining. When you do use collision chaining, meeting all the requirements is somewhere between easy and trivial. bucket_count() is the number of elements in your array. bucket_size() is the number of elements in the collision chain. Getting them in constant and linear time respectively is simple and straightforward.

By contrast, if you use something like linear probing or double hashing, those requirements become all but impossible to meet. Specifically, all the items that hashed to a specific value need to land in the same bucket, and you need to be able to count those buckets in constant time.

But, if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.

With enough extra work and a fair amount of stretching the meaning of some of the requirements almost to the breaking point, it might be barely possible to create a hash table using something other than collision chaining, and still at least sort of meet the requirements--but I'm not really certain it's possible, and it would certain involve quite a lot of extra work.

Summary: all practical implementations of std::unordered_set (or unordered_map) undoubtedly use collision chaining. While it might be (just barely) possible to meet the requirements using linear probing or double hashing, such an implementation seems to lose a great deal and gain nearly nothing in return.

Solution 2

I found this answer looking for how to detect when my types are colliding, so I will post this in case that is the intent of the question.:

I believe there's some misconception about "Unique keys No two elements in the container can have equivalent keys."

look at the code below

//pseudocode
std::unordered_map<int, char> hashmap;
hashmap[5] = 'a';
hashmap[5] = 'b'; //replace 'a' with 'b', there is no collision being handled.

I think the Jerry's answer is referring to the internal system that it uses to shrink keys to appropriate array indices.

If you want collisions to be handled for your types (with buckets), you need std::unordered_multimap and will have to iterate over

Hopefully this code can be read without the context I generated it with. it basically checks to see if any element in the bucket associated with the hash is the element I'm looking for.

//sp is std::shared_ptr
//memo is std::unordered_multimap< int, sp<AStarNode> >

//there's probably multiple issues with this code in terms of good design (like using int keys rather than unsigned)

bool AStar_Incremental::hasNodeBeenVisited(sp<AStarNode> node)
{
    using UMIter = std::unordered_multimap<int, sp<AStarNode> >::iterator;

    bool bAlreadyVisited = false;

    //get all values for key in O(1*)
    int hash = WorldGrid::hashGrid(node->location);
    std::pair<UMIter, UMIter> start_end = memo.equal_range(hash); //bucket range
    UMIter start = start_end.first;
    UMIter end = start_end.second;

    //hopefully this is implemented to be O(m) where m is the bucket size.
    for(UMIter bucketIter = start; bucketIter != end; ++bucketIter)
    {
        sp<AStarNode> previousNode = bucketIter->second;
        sf::Vector2i& previousVisit = previousNode->location;
        if (previousVisit == node->location)
        {
            bAlreadyVisited = true;
            break;
        }
    }

    return bAlreadyVisited;
}

33,840

Author by

whiteSkar

Software Engineer Game Developer C++, C#, Python

Updated on March 16, 2020

Comments

whiteSkar about 4 years

How does C++ STL unordered_map resolve collisions?

Looking at the http://www.cplusplus.com/reference/unordered_map/unordered_map/, it says "Unique keys No two elements in the container can have equivalent keys."

That should mean that the container is indeed resolving collisions. However, that page does not tell me how it is doing it. I know some ways to resolve collisions like using linked lists and/or probing. What I want to know is how the c++ STL unordered_map is resolving it.
Tony Delroy over 10 years

"It's still possible to meet the requirement [using linear probing or double hashing]" / "Then you walk through the "chain" of occupied items to find a slot where you can insert this item." isn't compatible with the "Keys with the same hash code appear in the same bucket." requirement. Of course, it's debatable if collision chaining puts anything "in the same bucket" anyway, but you're clearly proposing putting keys with the same hash in different buckets. 23.2.5/9 seems an unfortunate restriction... probing is useful for more than "load factor low" - many benefits from avoiding heap....
Jerry Coffin over 10 years

@TonyD: not really--in this case, a "bucket" is no longer a physical thing, but simply all the slots that hashed to the same value. You'd have to build enough intelligence into local_iterator to skip across any colliding items, but it still looks entirely possible to me.
Tony Delroy over 10 years

If it's that flexible, "Keys with the same hash code appear in the same bucket." has no meaning at all. Agree to disagree etc.. Cheers.
Jerry Coffin over 10 years

@TonyD: Not really. A bucket is the set of items with keys that hashed to a particular value, and which can be iterated with a local_iterator. Given that they're specifically avoiding specifying an implementation, it really shouldn't mean much about the implementation.
Tony Delroy over 10 years

Well, finding a mutually-agreed authoritative definition of "bucket" from e.g. Dijkstra isn't easy with Google, so let's just consider whether your postulated implementation can meet other requirements. Per your rationale for needing count, b.find(k) can't be worst-case O(b.size()) where b is a "logical" bucket size(). Further, b.begin(n), b.end(n) et al are required to have constant complexity, so would need to be stored alongside your count at the originally hashed-to bucket.
rb612 about 6 years

@JerryCoffin great explanation thank you - I'm curious though what you mean by being "linear on the number of items that hashed to the same or a colliding value". Colliding values, by definition, are ones that hash to the same value, right? So what does "or a colliding value" mean?
Jerry Coffin over 5 years

@rb612: Hashing is usually done in two pieces. You start by computing some large, fixed-sized value (e.g., 64 bits). Then you reduce that to the range of indices for your hash table. As I was using the terms, "hashing to the same value" would mean the original hashes were equal, and colliding referred to the reduced range one being equal.
underscore_d over 5 years

I perceive another heavy implication about implementation: Chaining seems inextricably related (by cause or consequence, I don't know; I haven't read any original rationale) with the Standard's guarantee that references stay valid forever, doesn't it? By my understanding, with open addressing, erasing would mean keeping around 'dummy' elements to avoid shifting others (rather than only optionally optimising through use of this like some open addressing maps), and insertion is precluded by not being able to put elements at the right place (at least without rebuilding or over-reserving buckets).
Jerry Coffin over 5 years

@underscore_d: Yeah, that would get...ugly as well. Haven't thought through carefully enough to be sure it's truly impossible, but even at best it would be ugly.
underscore_d over 5 years

Yeah, it might be possible to guarantee reference validity under open addressing, for highly specific contents and/or spans of time. However, it all seems highly contrary to ever being a general-purpose container, so it naturally follows that we must choose one or the other. I see someone else came to the same conclusion as me - and doesn't like it! - here, although I'm not so critical given that my current use for std::unordered_map uses references to avoid a 2nd round of lookups. Then again, if lookup were faster, I wouldn't need that... haha
underscore_d over 5 years

Getting back to how/why the Standard does what it does, this answer quotes the original 2003 proposal, which did consider and ultimately discount open addressing due to technical complexities and a lack of proven implementation experience. I guess things might be different today, at least judging by the number and benchmarks of open addressing implementations I've been reading about.