What is the most efficient std container for non-duplicated items?

14,513

Solution 1

Both set and map has O(log(N)) performance for looking up keys. vector has O(N).

The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.

In both cases, you should just use insert() instead of doing a find().

The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like

Map.insert(std::make_pair(Element, ID));

for a map or

Set.insert(Element);

for a set.

You can consult the return value to determine whether or not an insertion was actually performed.


If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().

Solution 2

If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).

std::set would probably be your second choice with O(lg n).

Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.

Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.

1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.

Solution 3

You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.

std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.

Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.

Solution 4

Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.

Solution 5

if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)

Share:
14,513
Roy
Author by

Roy

Updated on July 04, 2022

Comments

  • Roy
    Roy almost 2 years

    What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.

    // 1- Map
    map<int, int> Map;
    ...
    if(Map.find(Element)!=Map.end()) Map[Element]=ID;
    
    // 2-Vector
    vector<int> Vec;
    ...
    if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
    
    // 3-Set
    // Edit: I made a mistake: set::find is O(LogN) not O(N)
    
  • Alex Chamberlain
    Alex Chamberlain over 11 years
    There are a lot of issues with using a hash map though, such as the size of your objects and the security of the hash. They are good, but they aren't the one and only solution to every problem.
  • Corbin
    Corbin over 11 years
    @AlexChamberlain Yes, I noted that (though implicitly). Also, the hashing algorithm used for hash maps have nothing to do with security. While collisions are still seen as bad, there are different categories of hash functions. A hash table hash needs to be extremely fast, which tends to never coincide with secure. Also, can you expand on the size comment? I'm not sure I understand that one. There will be extra value_type instances, but the keys have to be stored in a hash map the same way that they're stored in a set.
  • Alex Chamberlain
    Alex Chamberlain over 11 years
    Security - If a known hash has a known pattern in its collisions and an attacker can encourage your hash map to store them, the buckets will become very large and could cause your program to become super unresponsive. In a server program, for example, this is a security problem and one that Java has suffered (many times?)
  • Alex Chamberlain
    Alex Chamberlain over 11 years
    Size - if your objects are huge and you are only storing a few of them, then hashing the objects - an O(m) runtime in the size of the object - becomes rather significant and you could benefit from using a std::set which may be able to get away with only comparing part of an object.
  • Corbin
    Corbin over 11 years
    @AlexChamberlain Ah, true on both points. As you said, a less-than comparison can short circuit whereas a hashing operation cannot. Typically the hashing is only done on a relatively small part of an object though (and often fixed in size). As for purposely causing collisions and making huge buckets (and thus long linear time operations), that would be quite an interesting security hole. While very feasible, it seems like that would be pretty rare in the wild. I suppose 'rare' still exists though :).
  • Alex Chamberlain
    Alex Chamberlain over 11 years
    Unfortunately, it took down several servers based on Java's default hash functions with strings like a, aa, aaa etc. I'll try to turf up some references tomorrow (UK time).
  • Alex Chamberlain
    Alex Chamberlain over 11 years
    You also lose cache coherence over a sorted vector and/or well designed trees. Again, I'll see if I can find a reference.
  • Corbin
    Corbin over 11 years
    @AlexChamberlain a, aa, aaa, etc used to hash to the same thing?! That's just terrible design. Hopefully collisions wouldn't be that trivial to find even for a non-security aimed hashing function. As for cache coherence, what it all really comes down to is the OP needs to benchmark. We could consider details all day on performance, but benchmarking is the only way to truly know (well, without getting waayyy carried away with code analysis).