How to make elements of vector unique? (remove non adjacent duplicates)

46,115

Solution 1

Without using a temporary set it's possible to do this with (possibly) some loss of performance:

template<class Iterator>
Iterator Unique(Iterator first, Iterator last)
{
    while (first != last)
    {
        Iterator next(first);
        last = std::remove(++next, last, *first);
        first = next;
    }

    return last;
}

used as in:

vec.erase( Unique( vec.begin(), vec.end() ), vec.end() );

For smaller data sets, the implementation simplicity and lack of extra allocation required may offset the theoretical higher complexity of using an additional set. Measurement with a representative input is the only way to be sure, though.

Solution 2

I think you would do it like this:

I would use two iterators on the vector :

The first of one reads the data and inserts it a temporary set.

When the read data was not in the set you copy it from the first iterator to the second and increment it.

At the end you keep only the data up to the second iterator.

The complexity is O( n .log( n ) ) as the lookup for duplicated elements uses the set, not the vector.

#include <vector>
#include <set>
#include <iostream>

int main(int argc, char* argv[])
{
    std::vector< int > k ;

    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 6 );
    k.push_back( 1 );
    k.push_back( 4 );
    k.push_back( 6 );
    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 1 );

{
    std::vector< int >::iterator r , w ;

    std::set< int > tmpset ;

    for( r = k.begin() , w = k.begin() ; r != k.end() ; ++r )
    {
        if( tmpset.insert( *r ).second )
        {
            *w++ = *r ;
        }
    }

    k.erase( w , k.end() );
}


    {
        std::vector< int >::iterator r ;

        for( r = k.begin() ; r != k.end() ; ++r )
        {
            std::cout << *r << std::endl ;
        }
    }
}

Solution 3

As the question was "is there any STL algorithm...? what is its complexity?" it makes sense to implement the function like std::unique:

template <class FwdIterator>
inline FwdIterator stable_unique(FwdIterator first, FwdIterator last)
{
    FwdIterator result = first;
    std::unordered_set<typename FwdIterator::value_type> seen;

    for (; first != last; ++first)
        if (seen.insert(*first).second)
            *result++ = *first;
    return result;
}

So this is how std::unique is implemented plus an extra set. The unordered_set shall be faster than a regular set. All elements are removed that compare equal to the element right preceding them (the first element is kept because we cannot unify to nothing). The iterator returned points to the new end within the range [first,last).

EDIT: The last sentence means that the container itself is NOT modified by unique. This can be confusing. The following example actually reduces the container to the unified set.

1: std::vector<int> v(3, 5);
2: v.resize(std::distance(v.begin(), unique(v.begin(), v.end())));
3: assert(v.size() == 1);

Line 1 creates a vector { 5, 5, 5 }. In line 2 calling unique returns an iterator to the 2nd element, which is the first element that is not unique. Hence distance returns 1 and resize prunes the vector.

Solution 4

You can remove some of the loops in fa's answer using remove_copy_if:

class NotSeen : public std::unary_function <int, bool>
{
public:
  NotSeen (std::set<int> & seen) : m_seen (seen) { }

  bool operator ()(int i) const  {
    return (m_seen.insert (i).second);
  }

private:
  std::set<int> & m_seen;
};

void removeDups (std::vector<int> const & iv, std::vector<int> & ov) {
  std::set<int> seen;
  std::remove_copy_if (iv.begin ()
      , iv.end ()
      , std::back_inserter (ov)
      , NotSeen (seen));
}

This has no affect on the complexity of the algorithm (ie. as written it's also O(n log n)). You can improve upon this using unordered_set, or if the range of your values is small enough you could simply use an array or a bitarray.

Solution 5

Based on the answer of @fa. It can also get rewritten using the STL algorithm std::stable_partition:

struct dupChecker_ {
    inline dupChecker_() : tmpSet() {}
    inline bool operator()(int i) {
        return tmpSet.insert(i).second;
    }
private:
    std::set<int> tmpSet;
};

k.erase(std::stable_partition(k.begin(), k.end(), dupChecker_()), k.end());

This way it is more compact and we don't need to care of the iterators.

It seems even not to introduce to much performance penalty. I use it in my project which needs to handle quite large vectors of complex types often and it makes no real difference.

Another nice feature is, that it is possible to adjust the uniqueness by using std::set<int, myCmp_> tmpSet;. For instance, in my project to ignore certain rounding errors.

Share:
46,115
aJ.
Author by

aJ.

I am a C++ developer from INDIA, interested in C, C#.NET, Perl.

Updated on August 12, 2020

Comments

  • aJ.
    aJ. almost 4 years

    I have a vector containing few non-adjacent duplicates.

    As a simple example, consider:

    2 1 6 1 4 6 2 1 1
    

    I am trying to make this vector unique by removing the non-adjacent duplicates and maintaining the order of elements.

    Result would be:

    2 1 6 4 
    

    The solutions I tried are:

    1. Inserting into a std::set but the problem with this approach is that it will disturb the order of elements.
    2. Use the combination of std::sort and std::unique. But again same order problem.
    3. Manual duplicate elimination:

          Define a temporary vector TempVector.
          for (each element in a vector)
          {
              if (the element does not exists in TempVector)
              {
                  add to TempVector;
              }
          }
          swap orginial vector with TempVector.
      

    My question is:

    Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?