Which is the fastest STL container for find?

c++ optimization stl find

59,186

Solution 1

For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.

For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.

However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.

Solution 2

A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.

Solution 3

Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.

This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.

To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).

I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.

Solution 4

If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.

If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.

Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.

View more solutions

59,186

Author by

AJG85

Software Engineering primarily with C++ is my day job. Fields of experience in this realm are 3d graphics, embedded systems, telecom, GPS, web, multimedia, server/client architecture, multi-threading, OOP, and design patterns. Audio Engineering primarily with drums is my night job. Fields of experience in this realm are recording, mixing, sampling, mastering, live performances, and drum lessons. Everything else falls somewhere between hobby and habit.

Updated on May 11, 2020

Comments

AJG85 almost 4 years

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.

I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.

My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.

Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.

I'm also open to using C++0x features supported by VS2010 or Boost.
Kerrek SB almost 13 years

Even without C++0x you can look for the unordered containers in TR1, i.e. #include <tr1/unordered_map> and std::tr1::unordered_map<K,V>. You really have to test, though, because the asymptotics (O(1) vs O(log n)) don't tell you the size of the constant, and n might have to be titanic before the unordered container is more efficient!
R. Martinho Fernandes almost 13 years

Yes, testing is important! If it isn't a lot of data, the "slower" containers may work better. And when using the unordered ones, the quality of the hash function can also make a difference.
AJG85 almost 13 years

For almost all cases the vectors combined size will be less than 500 elements. They are separate for organization only. I need to be able to search on a name field as well as an id field thus the predicates and std::find_if using lambdas.
AJG85 almost 13 years

I failed to mention I would need to find on multiple values which would require re-sorting by name or by id.
Howard Hinnant almost 13 years

This is by far the best answer to the question actually asked.
Sumudu Fernando almost 13 years

In my experience a sorted vector is noticeably faster than a set (and less memory intensive too); the obvious trade-off being that modifying the stored data is expensive. If you need to look up by different fields, the simplest solution is to have a different sorted vector for each field (probably vectors of pointers to avoid duplicating a bunch of information)
AJG85 almost 13 years

How would the typedef make a difference other than less typing possibly? Template instantiations are resolved at compile time and I use C++0x auto keyword liberally for STL iterators to make code more readable.
AJG85 almost 13 years

The elements are complex so caching of a large amount of elements would be counterproductive it's only useful in this one case as I know the tables have limited data and limited writes. While currently not heavily implemented multi-threading is a real possibility later so I had pretty much ruled hash tables out. There does not need to be a range as there are context based RE2 regex template functions elsewhere for that functionality across table string fields. However I will give Boost.MultiIndex a perusal as I'm unfamiliar but that sounds quite useful.
Rob K over 12 years

Because when you decide you used the wrong container type and you want to change it, you only have to change it in one place.
AJG85 over 12 years

Containers while similar usually have differing methods and usages. For example if I were to change from std::vector to std::set I would need to change all references to push_back and empty at a minimum. So the typedef doesn't buy anything auto isn't already providing.
William almost 12 years

For anyone just coming to this question like I am. I needed to do roughly a million+ lookups on a set of roughly 1 million values. Using an unordered_set it took around 7 seconds to preform all the lookups. It took over 30 min (I stopped it after finding this answer) using a vector. This is why I love stackoverflow! Thank you R. Martinho Fernandes!
serge about 4 years

Yes, it does many advantages, i.e. using as the type of some function parameter or of a class member.
Rob K about 4 years

The typedef (now using would be even better) provides naming, an explanation. It tells you what purpose the container serves, what it is, instead of how it's implemented. std::vector<string> tells me how it's implemented but it doesn't tell me what it's for. employee_names and customer_names might both be std::vector<string> but shows a distinction in code such as function signatures as to what's expected.
Rob K about 4 years

Also, before C++20, you couldn't use auto as function parameter, and in most cases, you still don't want to.
Zebrafish over 2 years

The thing about std::vectors being less memory intensive than a linked list (that has to store indices or pointers to other nodes), is that std::vector can waste a lot of memory themselves. Under a standard allocation model for the vector if it grows to size 257 then it'll next reserve 512 * sizeof('element') bytes of memory, and then all that memory will be wasted, at least the unused portion.
Mark Ransom over 2 years

@Zebrafish wasted memory won't affect the running time. And you're assuming that the vector grows by a factor of 2 every time it increases, but that's implementation dependent; I've heard of 1.5x being used instead to lower the maximum unused memory.