Which is the fastest STL container for find?
Solution 1
For searching a particular value, with std::set
and std::map
it takes O(log N) time, while with the other two it takes O(N) time; So, std::set
or std::map
are probably better. Since you have access to C++0x, you could also use std::unordered_set
or std::unordered_map
which take constant time on average.
For find_if
, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if
frequently with a certain predicate, you can optimize yourself: use a std::map
or std::set
with a custom comparator or special keys and use find
instead.
Solution 2
A sorted vector using std::lower_bound
can be just as fast as std::set
if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Solution 3
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500
is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc
?) if the performance difference is either unnoticeable or simply does not matter.
Solution 4
If you only want to search for distinct values, one specific column in the table, then std::hash
is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
AJG85
Software Engineering primarily with C++ is my day job. Fields of experience in this realm are 3d graphics, embedded systems, telecom, GPS, web, multimedia, server/client architecture, multi-threading, OOP, and design patterns. Audio Engineering primarily with drums is my night job. Fields of experience in this realm are recording, mixing, sampling, mastering, live performances, and drum lessons. Everything else falls somewhere between hobby and habit.
Updated on May 11, 2020Comments
-
AJG85 almost 4 years
Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two
std::vectors
one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data viastd::find
,std::find_if
, etc.Does anyone know if using
std::list
,std::set
, orstd::map
overstd::vector
for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.I'm also open to using C++0x features supported by VS2010 or Boost.
-
Kerrek SB almost 13 yearsEven without C++0x you can look for the unordered containers in TR1, i.e.
#include <tr1/unordered_map>
andstd::tr1::unordered_map<K,V>
. You really have to test, though, because the asymptotics (O(1) vs O(log n)) don't tell you the size of the constant, and n might have to be titanic before the unordered container is more efficient! -
R. Martinho Fernandes almost 13 yearsYes, testing is important! If it isn't a lot of data, the "slower" containers may work better. And when using the unordered ones, the quality of the hash function can also make a difference.
-
AJG85 almost 13 yearsFor almost all cases the vectors combined size will be less than 500 elements. They are separate for organization only. I need to be able to search on a name field as well as an id field thus the predicates and
std::find_if
using lambdas. -
AJG85 almost 13 yearsI failed to mention I would need to find on multiple values which would require re-sorting by name or by id.
-
Howard Hinnant almost 13 yearsThis is by far the best answer to the question actually asked.
-
Sumudu Fernando almost 13 yearsIn my experience a sorted vector is noticeably faster than a set (and less memory intensive too); the obvious trade-off being that modifying the stored data is expensive. If you need to look up by different fields, the simplest solution is to have a different sorted vector for each field (probably vectors of pointers to avoid duplicating a bunch of information)
-
AJG85 almost 13 yearsHow would the typedef make a difference other than less typing possibly? Template instantiations are resolved at compile time and I use C++0x
auto
keyword liberally for STL iterators to make code more readable. -
AJG85 almost 13 yearsThe elements are complex so caching of a large amount of elements would be counterproductive it's only useful in this one case as I know the tables have limited data and limited writes. While currently not heavily implemented multi-threading is a real possibility later so I had pretty much ruled hash tables out. There does not need to be a range as there are context based RE2 regex template functions elsewhere for that functionality across table string fields. However I will give Boost.MultiIndex a perusal as I'm unfamiliar but that sounds quite useful.
-
Rob K over 12 yearsBecause when you decide you used the wrong container type and you want to change it, you only have to change it in one place.
-
AJG85 over 12 yearsContainers while similar usually have differing methods and usages. For example if I were to change from
std::vector
tostd::set
I would need to change all references topush_back
andempty
at a minimum. So thetypedef
doesn't buy anythingauto
isn't already providing. -
William almost 12 yearsFor anyone just coming to this question like I am. I needed to do roughly a million+ lookups on a set of roughly 1 million values. Using an unordered_set it took around 7 seconds to preform all the lookups. It took over 30 min (I stopped it after finding this answer) using a vector. This is why I love stackoverflow! Thank you R. Martinho Fernandes!
-
serge about 4 yearsYes, it does many advantages, i.e. using as the type of some function parameter or of a class member.
-
Rob K about 4 yearsThe
typedef
(nowusing
would be even better) provides naming, an explanation. It tells you what purpose the container serves, what it is, instead of how it's implemented.std::vector<string>
tells me how it's implemented but it doesn't tell me what it's for.employee_names
andcustomer_names
might both bestd::vector<string>
but shows a distinction in code such as function signatures as to what's expected. -
Rob K about 4 yearsAlso, before C++20, you couldn't use
auto
as function parameter, and in most cases, you still don't want to. -
Zebrafish over 2 yearsThe thing about std::vectors being less memory intensive than a linked list (that has to store indices or pointers to other nodes), is that std::vector can waste a lot of memory themselves. Under a standard allocation model for the vector if it grows to size 257 then it'll next reserve 512 * sizeof('element') bytes of memory, and then all that memory will be wasted, at least the unused portion.
-
Mark Ransom over 2 years@Zebrafish wasted memory won't affect the running time. And you're assuming that the vector grows by a factor of 2 every time it increases, but that's implementation dependent; I've heard of 1.5x being used instead to lower the maximum unused memory.