What's the most efficient way to erase duplicates and sort a vector?

464,799

Solution 1

I agree with R. Pate and Todd Gardner; a std::set might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

Let's compare three approaches:

Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

Convert to set (manually)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

Here's how these perform as the number of duplicates changes:

comparison of vector and set approaches

Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

Solution 2

I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an unordered_set.

Keep in mind that the unordered_set method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)

I did some digging inside the set and unordered_set implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).

Here are the 5 methods:

f1: Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

f2: Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

f3: Convert to set (manually)

set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );

f4: Convert to unordered_set (using a constructor)

unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

f5: Convert to unordered_set (manually)

unordered_set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]

The results (in seconds, smaller is better):

range         f1       f2       f3       f4      f5
[1,10]      1.6821   7.6804   2.8232   6.2634  0.7980
[1,1000]    5.0773  13.3658   8.2235   7.6884  1.9861
[1,100000]  8.7955  32.1148  26.5485  13.3278  3.9822

Solution 3

std::unique only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.

std::unique is defined to be stable, so the vector will still be sorted after running unique on it.

Solution 4

I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:

std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already

Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.

Solution 5

std::unique only works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.

Share:
464,799
Kyle Ryan
Author by

Kyle Ryan

Senior undergraduate in Computer Science at Washington State University.

Updated on December 06, 2021

Comments

  • Kyle Ryan
    Kyle Ryan over 2 years

    I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.

    I currently have the below code, but it doesn't work.

    vec.erase(
          std::unique(vec.begin(), vec.end()),
          vec.end());
    std::sort(vec.begin(), vec.end());
    

    How can I correctly do this?

    Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after std::unique is executed?

    Or is there another (perhaps more efficient) way to do all this?

  • Admin
    Admin almost 15 years
    Does unique require a sorted container, or does it simply only rearrange the input sequence so it contains no adjacent duplicates? I thought the latter.
  • notnoop
    notnoop almost 15 years
    Well to the point! std::set is specified to be a sorted unique set. Most implementations use an efficient ordered binary tree or something similar.
  • Tom
    Tom almost 15 years
    +1 Thought of set as well. Didnt want to duplicate this answer
  • Bill Lynch
    Bill Lynch almost 15 years
    @Pate, you're correct. It doesn't require one. It removes adjacent duplicates.
  • Logan Capaldo
    Logan Capaldo almost 15 years
    Why convert back to a vector?
  • Faisal Vali
    Faisal Vali almost 15 years
    +1 Templatize away! - but you can just write removeDuplicates(vec), without explicitly specifying the template arguments
  • MadCoder
    MadCoder almost 15 years
    Is std::set guaranteed to be sorted? It makes sense that in practice it would be, but does the standard require it?
  • Todd Gardner
    Todd Gardner almost 15 years
    Yup, see 23.1.4.9 "The fundamental property of iterators of associative containers is that they iterate through the containers in the non-descending order of keys where non-descending is defined by the comparison that was used to construct them"
  • bala.s
    bala.s almost 15 years
    I'm shocked that the constructor approach is consistently measurably worse than manual. You would that apart from some tiny constant overhead, it would just do the manual thing. Can anyone explain this?
  • newacct
    newacct almost 15 years
    @MadCoder: It doesn't necessarily "make sense" that a set is implemented in a way that is sorted. There are also sets implemented using hash tables, which are not sorted. In fact, most people prefer using hash tables when available. But the naming convention in C++ just so happens that the sorted associative containers are named simply "set" / "map" (analogous to TreeSet / TreeMap in Java); and the hashed associative containers, which were left out of the standard, are called "hash_set" / "hash_map" (SGI STL) or "unordered_set" / "unordered_map" (TR1) (analogous to HashSet and HashMap in Java)
  • Max Lybbert
    Max Lybbert almost 15 years
    If you have a container that may have duplicates, and you want a container that doesn't have any duplicated values anywhere in the container then you must first sort the container, then pass it to unique, and then use erase to actually remove the duplicates. If you simply want to remove adjacent duplicates, then you won't have to sort the container. But you will end up with duplicated values: 1 2 2 3 2 4 2 5 2 will be changed to 1 2 3 2 4 2 5 2 if passed to unique without sorting, 1 2 3 4 5 if sorted, passed to unique and erase.
  • Nate Kohl
    Nate Kohl almost 15 years
    @Dan: Sure, I think resize would work. It didn't seem any faster on a little test data, but a more thorough experiment might prove otherwise.
  • Kyle Ryan
    Kyle Ryan almost 15 years
    Or even better, just have it take templated iterators directly (begin and end), and you can run it on other structures besides a vector.
  • Kyle Ryan
    Kyle Ryan almost 15 years
    Cool, thanks for the graph. Could you give a sense of what the units are for Number of Duplicates? (ie, around how large is "large enough")?
  • Nate Kohl
    Nate Kohl almost 15 years
    @Kyle: It's pretty large. I used datasets of 1,000,000 randomly drawn integers between 1 and 1000, 100, and 10 for this graph.
  • kccqzy
    kccqzy about 11 years
    I don't understand the rationale here. So if you have a container of pointers, and you want to remove duplicates, how does that affect the objects pointed to by the pointers? No memory leaks would happen because there is at least one pointer (and exactly one in this container) that points to them. Well, well, I guess your method might have some merit with some strange overloaded operators or strange comparison functions that do require special consideration.
  • joe
    joe about 11 years
    Not sure if I understand your point. Take a simple case of a vector<int*>, where the 4 pointers point to integers {1, 2. 2, 3}. Its sorted, but after you call std::unique, the 4 pointers are pointers to integers {1, 2, 3, 3}. Now you have two identical pointers to 3, so if you call delete, it does a duplicate delete. BAD! Second, notice that the 2nd 2 is misssing, a memory leak.
  • joe
    joe about 11 years
    kccqzy, heres the example program for you to understand my answer better:
  • joe
    joe about 11 years
    PS: I also ran "valgrind ./Main10", and valgrind found no problems. I HIGHLY recommend all C++ programmers using LINUX to use this very productive tool, esp if you are writing real-time applications that must run 24x7, and never leak or crash!
  • joe
    joe about 11 years
    The heart of the problem with std::unique can be summed up by this statement "std::unique returns duplicates in unspecified state" !!!!! Why the standards committee did this, I'll never know. Committe members.. any comments ???
  • Viktor Dahl
    Viktor Dahl about 11 years
    I agree with @Ari. What compiler was used for this?
  • Nate Kohl
    Nate Kohl about 11 years
    @ViktorDahl: it's been a while, but I'm guessing it was some version of gcc on cygwin. I just tried again and saw a similar result with clang-3.2, but not with gcc-4.8, so it could very will be a compiler peculiarity.
  • Viktor Dahl
    Viktor Dahl about 11 years
    @NateKohl Thank you for your reply. I found it quite surprising that the constructor method was constructor method was slower. Even more surprising that is still is in 2013.
  • Vitali
    Vitali almost 11 years
    I found that for 5000 elements (on my machine), inserting into an unordered_set, then copying to a vector & sorting the vector is faster than using set.
  • Yang Chi
    Yang Chi over 10 years
    I have a question about this statement: "vec.erase( unique( vec.begin(), vec.end() ), vec.end() );". Does the correctness of this one rely on the order of function argument evaluation? What if a compiler decides to evaluate vec.end() first then the unique() function?
  • Nate Kohl
    Nate Kohl over 10 years
    @YangChi: argument evaluation order doesn't matter because std::unique only moves elements around without changing the physical container size. The value returned by std::unique is the new logical end of the container, but the container will still have (unspecified) values up to vec.end().
  • Yang Chi
    Yang Chi over 10 years
    @NateKohl Right. I just realize the mistake I made. So obvious. :) Thank you!
  • YoungJohn
    YoungJohn about 10 years
    perhaps resize the vector after clearing it so that there is only 1 memory allocation when building the vector. Maybe prefer std::move instead of std::copy to move the ints into the vector instead of copying them since the set will not be needed later.
  • alexk7
    alexk7 almost 10 years
    Yes, "std::unique returns duplicates in unspecified state". So, simply don't rely on a array that has been "uniqued" to manually manage memory! The simplest way to do this is to use std::unique_ptr instead of raw pointers.
  • davidnr
    davidnr almost 10 years
    I think your results are wrong. In my tests the more duplicated elements the faster vector (comparative) is, actually scales the other way around. Did you compile with optimizations on and runtime checks off?. On my side vector is always faster, up to 100x depending on the number of duplicates. VS2013, cl /Ox -D_SECURE_SCL=0.
  • Nate Kohl
    Nate Kohl almost 10 years
    @davidnr: it's been a while, so I don't have the original setup around anymore. I seem to remember that a set-based approach only became better when the number of duplicates was fairly high -- have you tried an extreme case with e.g. almost all duplicates?
  • davidnr
    davidnr almost 10 years
    Yes, it is consistently faster, (I tried from rand()%2 to rand()%32000). Except in the case where the size of the object is large. I've been thinking, and the allocator you're using may improve performance on the set<> tests. Anyway set is faster when using large objects instead of int (bigger than 200 bytes in my tests). Also checked std::binary_search() and is faster on sorted vector than set::find().
  • MFH
    MFH over 9 years
    @joe: Even if after std::unique you had [1, 2, 3, 2] you can't call delete on 2 as that would leave a dangling pointer to 2! => Simply don't call delete on the elements between newEnd = std::unique and std::end as you still have pointers to these elements in [std::begin, newEnd)!
  • qqqqq
    qqqqq over 9 years
    @Todd Can anyone please confirm or unconfirm the following statement? If you want stable and happy with n.log(n) go for std::set ( or from scratch code, red and black tree implementation ). If you are OK with unstable and happy with log(n) go for std::unordered_set ( or from scratch code, hash-table implementation ). Note that n.log(n) and and log(n) are related to computation in my statement and I assume plenty of space.
  • Todd Gardner
    Todd Gardner over 9 years
    @qqqqq 'Stable' and 'unstable' aren't relevant with unique sets. The difference between set and unordered_set is the sorting; if you need to access the elements in sorted order, then use set. If sorting is unimportant to you, use unordered_set (or for custom types your decision might be driven by which operators are implemented). If you need average low-latency lookup AND sorted traversal, use both. I'm not sure what log(n) you mean for the second one. Construction a set is nlog(n), but an unordeded_set would be on average n. Also construction is generally less relevant than find/insert/delete
  • Arne Vogel
    Arne Vogel over 8 years
    vector<T*> with owning pointers is usually an anti-pattern in C++11. unique works fine with vector<unique_ptr<T>>.
  • QuantumKarl
    QuantumKarl over 8 years
    Hells yeah, templates! quick fix for small lists, full STL style. +1 thx
  • Giovanni Funchal
    Giovanni Funchal over 8 years
    Link to the image is broken, anyone has a new one?
  • Nate Kohl
    Nate Kohl over 8 years
    @GiovanniFunchal: seems fine now, but I moved the image over to imgur.com just for safety's sake.
  • erip
    erip over 8 years
    Is there a reason in the first experiment that the vector is first sorted, then made unique? It seems like there might be some performance to be gained if it were made unique first then sorted.
  • BartoszKP
    BartoszKP almost 8 years
    Description of the x-axis seems to be missing.
  • Changming Sun
    Changming Sun almost 8 years
    For integers, you can use radix sort, which is much faster than std::sort.
  • Ben Voigt
    Ben Voigt over 7 years
    @ArneVogel: For trivial values of "works fine", perhaps. It's rather pointless to call unique on a vector<unique_ptr<T>>, as the only duplicated value such a vector can contain is nullptr.
  • Toby Speight
    Toby Speight over 7 years
    @Kyle - only on other containers that have an erase() method, else you have to return the new end iterator and have the calling code truncate the container.
  • Toby Speight
    Toby Speight over 7 years
    This appears to be a response to a different answer; it does not answer the question (in which the vector contains integers, not pointers, and does not specify a comparator).
  • Arne Vogel
    Arne Vogel almost 7 years
    @BenVoigt: Overload operator == or pass a BinaryPredicate. The default behavior would be pointless indeed.
  • Davmrtl
    Davmrtl almost 7 years
    Quick tip, to use sort or unique methods, you have to #include <algorithm>
  • Ilya Popov
    Ilya Popov about 6 years
    cplusplus.com is not in any way official documentation.
  • sandthorn
    sandthorn over 5 years
    @ChangmingSun I wonder why the optimizer seemed to fail on f4? The numbers are dramatically different to f5. It doesn't make any sense to me.
  • alexk7
    alexk7 over 5 years
    @sandthorn As explained in my answer, the implementation build a node (including dynamic allocation) for each element from the input sequence, which is wasteful for every value that ends up being a duplicate. There is no way the optimizer could know it could skip that.
  • sandthorn
    sandthorn over 5 years
    Ah, that reminds me of one of Scott Meyer's talks about the CWUK scenerio that has a nature of propablities to slow down the emplace kind of construction.
  • galactica
    galactica about 5 years
    interesting again that using manual conversion f5 runs much faster than using a constructor f4!
  • gast128
    gast128 over 4 years
    Perhaps use unordered_set instead of set (and boost::remove_erase_if if available)
  • Das_Geek
    Das_Geek over 4 years
    Welcome to StackOverflow! Please edit your question to add an explanation of how you code works, and why it's equivalent or better than the other answers. This question is more than ten years old, and already has many good, well-explained answers. Without an explanation on yours, it's not as useful and has a good chance of being downvoted or removed.
  • roberto
    roberto over 4 years
    for tuple indexed vector solution #1 is the fastest
  • Jakub Klinkovský
    Jakub Klinkovský about 4 years
    This is not the same benchmark as in the other answer, because the numbers of duplicates is almost surely different. Unfortunately, neither answer actually says how many duplicates there are in the random data used for the benchmarks...
  • v.oddou
    v.oddou about 4 years
    contiguous duplicates. ok, so it needs a std::sort first.
  • Fureeish
    Fureeish over 3 years
    No actions in C++20, unfortunately.
  • Gaspa79
    Gaspa79 over 3 years
    This would be amazing info if we had the number of elements. Whenever I see this I think of imgs.xkcd.com/comics/convincing.png
  • johv
    johv about 3 years
    @erip You have to sort before, since unique assumes the container is already sorted. (Or rather, it only removes adjacent duplicates!)
  • underscore_d
    underscore_d about 2 years
    Not sure this is valid. std::ranges::remove_if has pred constrained to std::indirect_binary_predicate, subsuming std::predicate, whose component regular_invocable "requires [it...] be equality-preserving" - & depending on state like values makes it not equality-preserving. I would presume the same applies to the old remove_if as you use.