What's the most efficient way to erase duplicates and sort a vector?
Solution 1
I agree with R. Pate and Todd Gardner; a std::set
might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.
Let's compare three approaches:
Just using vector, sort + unique
sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );
Convert to set (manually)
set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );
Convert to set (using a constructor)
set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
Here's how these perform as the number of duplicates changes:
Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.
And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.
Solution 2
I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an unordered_set
.
Keep in mind that the unordered_set
method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)
I did some digging inside the set
and unordered_set
implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).
Here are the 5 methods:
f1: Just using vector
, sort
+ unique
sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );
f2: Convert to set
(using a constructor)
set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
f3: Convert to set
(manually)
set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );
f4: Convert to unordered_set
(using a constructor)
unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );
f5: Convert to unordered_set
(manually)
unordered_set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );
I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]
The results (in seconds, smaller is better):
range f1 f2 f3 f4 f5
[1,10] 1.6821 7.6804 2.8232 6.2634 0.7980
[1,1000] 5.0773 13.3658 8.2235 7.6884 1.9861
[1,100000] 8.7955 32.1148 26.5485 13.3278 3.9822
Solution 3
std::unique
only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.
std::unique
is defined to be stable, so the vector will still be sorted after running unique on it.
Solution 4
I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:
std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already
Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.
Solution 5
std::unique
only works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.
Kyle Ryan
Senior undergraduate in Computer Science at Washington State University.
Updated on December 06, 2021Comments
-
Kyle Ryan over 2 years
I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.
I currently have the below code, but it doesn't work.
vec.erase( std::unique(vec.begin(), vec.end()), vec.end()); std::sort(vec.begin(), vec.end());
How can I correctly do this?
Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after
std::unique
is executed?Or is there another (perhaps more efficient) way to do all this?
-
Admin almost 15 yearsDoes unique require a sorted container, or does it simply only rearrange the input sequence so it contains no adjacent duplicates? I thought the latter.
-
notnoop almost 15 yearsWell to the point! std::set is specified to be a sorted unique set. Most implementations use an efficient ordered binary tree or something similar.
-
Tom almost 15 years+1 Thought of set as well. Didnt want to duplicate this answer
-
Bill Lynch almost 15 years@Pate, you're correct. It doesn't require one. It removes adjacent duplicates.
-
Logan Capaldo almost 15 yearsWhy convert back to a vector?
-
Faisal Vali almost 15 years+1 Templatize away! - but you can just write removeDuplicates(vec), without explicitly specifying the template arguments
-
MadCoder almost 15 yearsIs std::set guaranteed to be sorted? It makes sense that in practice it would be, but does the standard require it?
-
Todd Gardner almost 15 yearsYup, see 23.1.4.9 "The fundamental property of iterators of associative containers is that they iterate through the containers in the non-descending order of keys where non-descending is defined by the comparison that was used to construct them"
-
bala.s almost 15 yearsI'm shocked that the constructor approach is consistently measurably worse than manual. You would that apart from some tiny constant overhead, it would just do the manual thing. Can anyone explain this?
-
newacct almost 15 years@MadCoder: It doesn't necessarily "make sense" that a set is implemented in a way that is sorted. There are also sets implemented using hash tables, which are not sorted. In fact, most people prefer using hash tables when available. But the naming convention in C++ just so happens that the sorted associative containers are named simply "set" / "map" (analogous to TreeSet / TreeMap in Java); and the hashed associative containers, which were left out of the standard, are called "hash_set" / "hash_map" (SGI STL) or "unordered_set" / "unordered_map" (TR1) (analogous to HashSet and HashMap in Java)
-
Max Lybbert almost 15 yearsIf you have a container that may have duplicates, and you want a container that doesn't have any duplicated values anywhere in the container then you must first sort the container, then pass it to unique, and then use erase to actually remove the duplicates. If you simply want to remove adjacent duplicates, then you won't have to sort the container. But you will end up with duplicated values: 1 2 2 3 2 4 2 5 2 will be changed to 1 2 3 2 4 2 5 2 if passed to unique without sorting, 1 2 3 4 5 if sorted, passed to unique and erase.
-
Nate Kohl almost 15 years@Dan: Sure, I think resize would work. It didn't seem any faster on a little test data, but a more thorough experiment might prove otherwise.
-
Kyle Ryan almost 15 yearsOr even better, just have it take templated iterators directly (begin and end), and you can run it on other structures besides a vector.
-
Kyle Ryan almost 15 yearsCool, thanks for the graph. Could you give a sense of what the units are for Number of Duplicates? (ie, around how large is "large enough")?
-
Nate Kohl almost 15 years@Kyle: It's pretty large. I used datasets of 1,000,000 randomly drawn integers between 1 and 1000, 100, and 10 for this graph.
-
kccqzy about 11 yearsI don't understand the rationale here. So if you have a container of pointers, and you want to remove duplicates, how does that affect the objects pointed to by the pointers? No memory leaks would happen because there is at least one pointer (and exactly one in this container) that points to them. Well, well, I guess your method might have some merit with some strange overloaded operators or strange comparison functions that do require special consideration.
-
joe about 11 yearsNot sure if I understand your point. Take a simple case of a vector<int*>, where the 4 pointers point to integers {1, 2. 2, 3}. Its sorted, but after you call std::unique, the 4 pointers are pointers to integers {1, 2, 3, 3}. Now you have two identical pointers to 3, so if you call delete, it does a duplicate delete. BAD! Second, notice that the 2nd 2 is misssing, a memory leak.
-
joe about 11 yearskccqzy, heres the example program for you to understand my answer better:
-
joe about 11 yearsPS: I also ran "valgrind ./Main10", and valgrind found no problems. I HIGHLY recommend all C++ programmers using LINUX to use this very productive tool, esp if you are writing real-time applications that must run 24x7, and never leak or crash!
-
joe about 11 yearsThe heart of the problem with std::unique can be summed up by this statement "std::unique returns duplicates in unspecified state" !!!!! Why the standards committee did this, I'll never know. Committe members.. any comments ???
-
Viktor Dahl about 11 yearsI agree with @Ari. What compiler was used for this?
-
Nate Kohl about 11 years@ViktorDahl: it's been a while, but I'm guessing it was some version of gcc on cygwin. I just tried again and saw a similar result with clang-3.2, but not with gcc-4.8, so it could very will be a compiler peculiarity.
-
Viktor Dahl about 11 years@NateKohl Thank you for your reply. I found it quite surprising that the constructor method was constructor method was slower. Even more surprising that is still is in 2013.
-
Vitali almost 11 yearsI found that for 5000 elements (on my machine), inserting into an unordered_set, then copying to a vector & sorting the vector is faster than using set.
-
Yang Chi over 10 yearsI have a question about this statement: "vec.erase( unique( vec.begin(), vec.end() ), vec.end() );". Does the correctness of this one rely on the order of function argument evaluation? What if a compiler decides to evaluate vec.end() first then the unique() function?
-
Nate Kohl over 10 years@YangChi: argument evaluation order doesn't matter because
std::unique
only moves elements around without changing the physical container size. The value returned bystd::unique
is the new logical end of the container, but the container will still have (unspecified) values up tovec.end()
. -
Yang Chi over 10 years@NateKohl Right. I just realize the mistake I made. So obvious. :) Thank you!
-
YoungJohn about 10 yearsperhaps resize the vector after clearing it so that there is only 1 memory allocation when building the vector. Maybe prefer std::move instead of std::copy to move the ints into the vector instead of copying them since the set will not be needed later.
-
alexk7 almost 10 yearsYes, "std::unique returns duplicates in unspecified state". So, simply don't rely on a array that has been "uniqued" to manually manage memory! The simplest way to do this is to use std::unique_ptr instead of raw pointers.
-
davidnr almost 10 yearsI think your results are wrong. In my tests the more duplicated elements the faster vector (comparative) is, actually scales the other way around. Did you compile with optimizations on and runtime checks off?. On my side vector is always faster, up to 100x depending on the number of duplicates. VS2013, cl /Ox -D_SECURE_SCL=0.
-
Nate Kohl almost 10 years@davidnr: it's been a while, so I don't have the original setup around anymore. I seem to remember that a set-based approach only became better when the number of duplicates was fairly high -- have you tried an extreme case with e.g. almost all duplicates?
-
davidnr almost 10 yearsYes, it is consistently faster, (I tried from rand()%2 to rand()%32000). Except in the case where the size of the object is large. I've been thinking, and the allocator you're using may improve performance on the set<> tests. Anyway set is faster when using large objects instead of int (bigger than 200 bytes in my tests). Also checked std::binary_search() and is faster on sorted vector than set::find().
-
MFH over 9 years@joe: Even if after
std::unique
you had [1, 2, 3, 2] you can't call delete on 2 as that would leave a dangling pointer to 2! => Simply don't call delete on the elements betweennewEnd = std::unique
andstd::end
as you still have pointers to these elements in[std::begin, newEnd)
! -
qqqqq over 9 years@Todd Can anyone please confirm or unconfirm the following statement? If you want stable and happy with n.log(n) go for std::set ( or from scratch code, red and black tree implementation ). If you are OK with unstable and happy with log(n) go for std::unordered_set ( or from scratch code, hash-table implementation ). Note that n.log(n) and and log(n) are related to computation in my statement and I assume plenty of space.
-
Todd Gardner over 9 years@qqqqq 'Stable' and 'unstable' aren't relevant with unique sets. The difference between set and unordered_set is the sorting; if you need to access the elements in sorted order, then use set. If sorting is unimportant to you, use unordered_set (or for custom types your decision might be driven by which operators are implemented). If you need average low-latency lookup AND sorted traversal, use both. I'm not sure what log(n) you mean for the second one. Construction a set is nlog(n), but an unordeded_set would be on average n. Also construction is generally less relevant than find/insert/delete
-
Arne Vogel over 8 years
vector<T*>
with owning pointers is usually an anti-pattern in C++11.unique
works fine withvector<unique_ptr<T>>
. -
QuantumKarl over 8 yearsHells yeah, templates! quick fix for small lists, full STL style. +1 thx
-
Giovanni Funchal over 8 yearsLink to the image is broken, anyone has a new one?
-
Nate Kohl over 8 years@GiovanniFunchal: seems fine now, but I moved the image over to imgur.com just for safety's sake.
-
erip over 8 yearsIs there a reason in the first experiment that the vector is first sorted, then made unique? It seems like there might be some performance to be gained if it were made unique first then sorted.
-
BartoszKP almost 8 yearsDescription of the x-axis seems to be missing.
-
Changming Sun almost 8 yearsFor integers, you can use radix sort, which is much faster than std::sort.
-
Ben Voigt over 7 years@ArneVogel: For trivial values of "works fine", perhaps. It's rather pointless to call
unique
on avector<unique_ptr<T>>
, as the only duplicated value such a vector can contain isnullptr
. -
Toby Speight over 7 years@Kyle - only on other containers that have an
erase()
method, else you have to return the new end iterator and have the calling code truncate the container. -
Toby Speight over 7 yearsThis appears to be a response to a different answer; it does not answer the question (in which the
vector
contains integers, not pointers, and does not specify a comparator). -
Arne Vogel almost 7 years@BenVoigt: Overload
operator ==
or pass aBinaryPredicate
. The default behavior would be pointless indeed. -
Davmrtl almost 7 yearsQuick tip, to use
sort
orunique
methods, you have to#include <algorithm>
-
Ilya Popov about 6 yearscplusplus.com is not in any way official documentation.
-
sandthorn over 5 years@ChangmingSun I wonder why the optimizer seemed to fail on f4? The numbers are dramatically different to f5. It doesn't make any sense to me.
-
alexk7 over 5 years@sandthorn As explained in my answer, the implementation build a node (including dynamic allocation) for each element from the input sequence, which is wasteful for every value that ends up being a duplicate. There is no way the optimizer could know it could skip that.
-
sandthorn over 5 yearsAh, that reminds me of one of Scott Meyer's talks about the
CWUK
scenerio that has a nature of propablities to slow down theemplace
kind of construction. -
galactica about 5 yearsinteresting again that using manual conversion f5 runs much faster than using a constructor f4!
-
gast128 over 4 yearsPerhaps use unordered_set instead of set (and boost::remove_erase_if if available)
-
Das_Geek over 4 yearsWelcome to StackOverflow! Please edit your question to add an explanation of how you code works, and why it's equivalent or better than the other answers. This question is more than ten years old, and already has many good, well-explained answers. Without an explanation on yours, it's not as useful and has a good chance of being downvoted or removed.
-
roberto over 4 yearsfor tuple indexed vector solution #1 is the fastest
-
Jakub Klinkovský about 4 yearsThis is not the same benchmark as in the other answer, because the numbers of duplicates is almost surely different. Unfortunately, neither answer actually says how many duplicates there are in the random data used for the benchmarks...
-
v.oddou about 4 yearscontiguous duplicates. ok, so it needs a
std::sort
first. -
Fureeish over 3 yearsNo
actions
in C++20, unfortunately. -
Gaspa79 over 3 yearsThis would be amazing info if we had the number of elements. Whenever I see this I think of imgs.xkcd.com/comics/convincing.png
-
johv about 3 years@erip You have to sort before, since unique assumes the container is already sorted. (Or rather, it only removes adjacent duplicates!)
-
underscore_d about 2 yearsNot sure this is valid.
std::ranges::remove_if
haspred
constrained tostd::indirect_binary_predicate
, subsumingstd::predicate
, whose componentregular_invocable
"requires [it...] be equality-preserving" - & depending on state likevalues
makes it not equality-preserving. I would presume the same applies to the oldremove_if
as you use.