Most efficient way of erasing/deleting multiple std::vector elements while retaining original order?

22,695

Solution 1

In <algorithm> there is a remove_if function which squeezes all values not removed to the front maintaining the order. This works if those 200 elements can be purely determined by the values, not index.

This is essentially the Erase-remove idiom you have linked to. remove_if is guaranteed to perform O(N) comparisons (and at most O(N) copyings), which would be more efficient than sorting (O(N log N)), although your last option doesn't actually require sorting if the indices are determined from values (just scan in the reversed direction while copying).

Nevertheless, using remove_if (if you can) is better than the other 2 options because the implementation has already been written for you, so there's less chance of logical error and conveys better what (not how) to do.

Solution 2

How about looping through the vector, and for each element that needs to be removed, copy the next element that doesn't need to be removed in to that position. Then when you get to the end, truncate it.

int last = 0;
for(int i=0; i<vec.size(); ++i, ++last)
{
   while(needs_to_be_removed(i))
      ++i;
   if(i >= vec.size()) break;

   vec[last] = vec[i];   
}

vec.resize(last);

Solution 3

First thing is, don't call erase more times than you have to, because for a vector it shuffles all the later elements down, giving the whole operation an Ω(n*m) worst case run time (n the size of the vector, m the size of the list of indexes to remove).

I think the first thing I'd try would be similar to your current code:

  • sort the indexes
  • create a new vector of size n - m
  • iterate over the original vector, copying indexes[0] elements, skipping an element, then copying indexes[1] - indexes[0] - 1 elements, skip an element, and so on.
  • swap the original vector with the new one.

You might be able to do the third step with remove_copy_if and a predicate which contains state (counting how many items it has copied and how far it is through the sorted list of indexes), but for extremely tedious and obscure reasons this isn't guaranteed to work (algorithm predicates with mutable state are problematic, it seems to be the consensus that the standard doesn't guarantee that the same copy of the predicate is used throughout the algorithm). So I really don't advise trying it, but it might help to bear in mind that what you're writing basically is a modified version of remove_copy_if.

You could avoid the second step using a back_inserter rather than presizing the vector, although you'd presumably still reserve the space in advance.

[Edit: come to think of it, why am I copying anything? Rather than implementing a modified remove_copy_if, implement a modified remove_if, and just copy to an earlier point in the vector. Then erase/resize at the end. I wouldn't worry about the O(m log m) sort of the indexes until proven to be a problem, because it's unlikely to be significantly slower than the Ω(m) operation to read all the values to be removed, and store them in some kind of container. Then, using this container in the predicate to remove_if may or may not be O(1). Sorting might turn out faster for plausible values of m.]

Solution 4

You can copy all elements of the vector to a list unless the index in your second container, and then back to a vector. Even with your algorithm of going from the end of the vector to the front, there's a lot of work going on behind the scenes in your vector.

Make your second container a map so it keeps the indeces sorted for you automatically.

edit:

To respond to the comment

The cost of maintaining a map is worst case the same as maintaining another structure (list or vector) and then sorting it. If you're already doing that, you might as well keep it as a map. It doesn't make sense to complain about the overhead of a map vs. the overhead of sorting a list.

As for the performance of my suggested algorithm, if m is the number of elements to be deleted, and n is the total number of elements then it results in O(n - m).

Of course, this is mostly just humoring your attempt to optimize with a vector.

1 - You shouldn't be using a vector if you want to do random access deletes. That's not what they're good at, use a list if at all possible. And since you seem to be much more interested in relative order rather than absolute index, I am wondering why a vector is needed at all. If you gave the entire problem, there's probably a common solution to let you use the most efficient data structure to solve it.

2 - Instead of maintaining a second data structure, mark elements that need to be deleted directly in their container. A trivial way is instead using a container< T > use a container< std::pair< T, char > > and use the char to keep track of the element status.

If you do 1 and 2, you remove all copying completely and get a much more efficient implementation.

Solution 5

Elements of what? Maybe I'm taking your post to seriously but if you have a vector of 1000 elements why not mark the ones that are not valid anymore and do away with erasing in the first place. Obviously I'm making an assumption here that your elements are not demanding a lot of memory.

I only bring this up because you seem to be concerned with speed. If the suggestions already given don't do the trick maybe this idea is worth a thought! In essence speed things up by not doing the operation in the first place.

Share:
22,695
sascha
Author by

sascha

Updated on January 01, 2021

Comments

  • sascha
    sascha over 3 years


    i have a std::vector<int> and a second container holding iterators or indexes (no keys, i want constant access to the element) to this vector for deletion purposes. Let's assume i have a vector of 1000 elements and want to erase 200 of them. The order of the non-removed elements should be the same after the deletion operations like before.

    One more thing i missed in the first version of my question: the values are unique. They are identities.

    How would you do that in a safe (regarding the stl rules) and efficient manner (the decision for a vector shall be final)?

    Possibilities or Methods i thought about:

    • the erase-remove idiom (http://en.wikipedia.org/wiki/Erase-remove_idiom): originally for the deletion of elements which fulfill a condition (including linear search) but i think with ranges of size 1 this method could be used to with already given iterators and a dummy condition. Question: is the original order of elements kept and is it more performant than the last method?
    • loop over the indexes and erase the elements with the use of vector.erase(vector.begin()+index+offset) while keeping the indexes removed in a container for calculating the offset. This offset could be determined for every remove iteration with the use of a std::lower_bound n the container of already removed elements. The problem: A lot of binary_searches for getting the offset and a lot of move operations because of random-location-deletion.
    • At the moment I'm doing the following: get all the iterators for the elements to remove. Sort them in descending order according to the location in the vector and loop over them for the final deletion with vector.erase. Now I'm not invalidating any iterator and there are no vector rearrange-operations except for the deletion itself. The problem: a lot of sorting

    So, how would you tackle this? Any new ideas? Any recommendations?

    Thanks for your input.

    Sascha

    Edit / Update / Own results: I implemented the erase-remove idiom, which was also mentioned by KennyTM, with a predicate based on the lookup in a boost::dynamic_bitset and it's insanely fast. Furthermore i tried PigBen's move-and-truncate method (also mentioned by Steve Jessop) which is also accessing the bitset in it's while-loop. Both seem to be equally fast with my kind of data. I tried to delete 100 of 1000 Elements (unsigned ints), did this 100 deletes 1M times and there was no significant difference. Because i think the stl-based erase-remove idiom is kinda more "natural, i'm choosing this method (argument was also mentioned by KennyTM).

  • sascha
    sascha over 13 years
    Okay, thanks for your answer. Now my only problem is that i don't have a "fast" predicate for the remove_if function, but "already decided" indexes/iterators. Seems that i have to convert my decisions to a bitset which is telling me for each element if a remove should be done for using a predicate which runs in constant time. O(n) more work. But asymptotically it should be way better. Maybe i should rewrite my "decision-algorithm" to give out a bitset. Thanks again for your answer, especially mentioning that the original order of elements is kept.
  • sascha
    sascha over 13 years
    Doesn't seem to efficient to me, but i could be wrong. How does this work you mention compare to the asymptotic guarantees that KennyTM gave in his answer? If you're talking about resizing operations, then i would say there are no ops before the final deletions (because remove is only swapping). And a map alway brings a huge overhead (personal opinion, no big fan of balanced search trees).
  • sascha
    sascha over 13 years
    Still not convinced. I don't need the sorting step anymore (with the erase-remove idiom), so the asymptotic argument of holding a map is gone. And for the "don't use a vector" -> i wrote that this decision is final. I know the advantages and disadvantages of a vector (at least i think so). The reason for using vector: raw-iteration performance and cache-efficieny (which is bad for list AND map of course). I do a lot more iterating than deleting. My question is: are you thinking that your solution is faster than the erase-remove idiom where i have a constant predicate?
  • sascha
    sascha over 13 years
    That sounds simple. I will try it. Thanks for your answer.
  • sascha
    sascha over 13 years
    Your edited solution kinda sounds like the one proposed by PigBen, or did i misunderstood? So no more sorting involved or is it?
  • Steve Jessop
    Steve Jessop over 13 years
    @sascha: yes, it's pretty similar. I hadn't seen PigBen's answer. The difference is that he has a function needs_to_be_removed, where I count. Also, I waffle more.
  • sascha
    sascha over 13 years
    Like assumed, and mentioned in the opening, the elements are small integral types (int). I think the suggestions by the others did the trick and i'm satisfied now but nonetheless, i like your solution. The only question is, if when iterating trough these elements, the cache-efficiency is still high (because we now have much more elements.) But it seems like a good solution for some applications.
  • Admin
    Admin over 9 years
    This is equivalent to vec.erase(std::remove_if(vec.begin(), vec.end(), needs_to_be_removed), vec.end());.
  • Benjamin Lindley
    Benjamin Lindley over 9 years
    @robinjam: No it isn't. remove_if removes based on the values of the elements. This solution removes based on the element's position in the vector.
  • Admin
    Admin over 9 years
    @BenjaminLindley: Right you are. Apologies, I misread one of the lines in your code snippet.
  • Christopher Johnson
    Christopher Johnson about 8 years
    Can the while(needs_to_be_removed(i)) loop overrun the range [0,vec.size()) here? If the entire vector is marked for deletion, i will take on the value of vec.size(), which needs_to_be_removed will have to handle.
  • Benjamin Lindley
    Benjamin Lindley about 8 years
    @ChristopherJohnson: needs_to_be_removed is a hypothetical function. It can be written to handle the case of i >= vec.size(), and return false in that case.
  • smead
    smead almost 6 years
    i like this implementation except for the fact that it alters the indexes variable by sorting it. that is not the variable we are operating on so it should not be changed. better to pass by value and sort our local copy instead
  • Valeriy Ivanov
    Valeriy Ivanov almost 6 years
    You are right. But vectors could be large. The vector with indexes could be almost as large as the vector with values. So I decided to pass the vector with indices by reference.