how to find duplicates in std::vector<string> and return a list of them?

16,241

Solution 1

In 3 lines (not counting the vector and list creation nor the superfluous line-breaks in name of readability):

vector<string> vec{"words", "words", "are", "fun", "fun"};
list<string> output;

sort(vec.begin(), vec.end());
set<string> uvec(vec.begin(), vec.end());
set_difference(vec.begin(), vec.end(),
               uvec.begin(), uvec.end(),
               back_inserter(output));

EDIT

Explanation of the solution:

  1. Sorting the vector is needed in order to use set_difference() later.

  2. The uvec set will automatically keep elements sorted, and eliminate duplicates.

  3. The output list will be populated by the elements of vec - uvec.

Solution 2

  1. Make an empty std::unordered_set<std::string>
  2. Iterator your vector, checking whether each item is a member of the set
  3. If it's already in the set, this is a duplicate, so add to your result list
  4. Otherwise, add to the set.

Since you want each duplicate only listed once in the results, you can use a hashset (not list) for the results as well.

Solution 3

IMO, Ben Voigt started with a good basic idea, but I would caution against taking his wording too literally.

In particular, I dislike the idea of searching for the string in the set, then adding it to your set if it's not present, and adding it to the output if it was present. This basically means every time we encounter a new word, we search our set of existing words twice, once to check whether a word is present, and again to insert it because it wasn't. Most of that searching will be essentially identical -- unless some other thread mutates the structure in the interim (which could give a race condition).

Instead, I'd start by trying to add it to the set of words you've seen. That returns a pair<iterator, bool>, with the bool set to true if and only if the value was inserted -- i.e., was not previously present. That lets us consolidate the search for an existing string and the insertion of the new string together into a single insert:

while (input >> word)
    if (!(existing.insert(word)).second)
        output.insert(word);

This also cleans up the flow enough that it's pretty easy to turn the test into a functor that we can then use with std::remove_copy_if to produce our results quite directly:

#include <set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>

class show_copies {
    std::set<std::string> existing;
public:
    bool operator()(std::string const &in) {
        return existing.insert(in).second;
    }
};

int main() {
    std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
    std::set<std::string> result;

    std::remove_copy_if(words.begin(), words.end(),
        std::inserter(result, result.end()), show_copies());

    for (auto const &s : result)
        std::cout << s << "\n";
}

Depending on whether I cared more about code simplicity or execution speed, I might use an std::vector instead of the set for result, and use std::sort followed by std::unique_copy to produce the final result. In such a case I'd probably also replace the std::set inside of show_copies with an std::unordered_set instead:

#include <unordered_set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>

class show_copies {
    std::unordered_set<std::string> existing;
public:
    bool operator()(std::string const &in) {
        return existing.insert(in).second;
    }
};

int main() {
    std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
    std::vector<std::string> intermediate;

    std::remove_copy_if(words.begin(), words.end(),
        std::back_inserter(intermediate), show_copies());

    std::sort(intermediate.begin(), intermediate.end());
    std::unique_copy(intermediate.begin(), intermediate.end(),
        std::ostream_iterator<std::string>(std::cout, "\n"));
}

This is marginally more complex (one whole line longer!) but likely to be substantially faster when/if the number of words gets very large. Also note that I'm using std::unique_copy primarily to produce visible output. If you just want the result in a collection, you can use the standard unique/erase idiom to get unique items in intermediate.

Solution 4

In place (no additional storage). No string copying (except to result list). One sort + one pass:

#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <algorithm>
using namespace std;

int main() {
        vector<string> vec{"words", "words", "are", "fun", "fun"};
        list<string> dup;

        sort(vec.begin(), vec.end());

        const string  empty{""};
        const string* prev_p = &empty;

        for(const string& s: vec) {
                if (*prev_p==s) dup.push_back(s);
                prev_p = &s;
        }

        for(auto& w: dup) cout << w << ' '; 
        cout << '\n';
}
Share:
16,241
Marina Golubtsova
Author by

Marina Golubtsova

Updated on July 24, 2022

Comments

  • Marina Golubtsova
    Marina Golubtsova almost 2 years

    So if I have a vector of words like:

    Vec1 = "words", "words", "are", "fun", "fun"
    

    resulting list: "fun", "words"

    I am trying to determine which words are duplicated, and return an alphabetized vector of 1 copy of them. My problem is that I don't even know where to start, the only thing close to it I found was std::unique_copy which doesn't exactly do what I need. And specifically, I am inputting a std::vector<std::string> but outputting a std::list<std::string>. And if needed, I can use functor.

    Could someone at least push me in the right direction please? I already tried reading stl documentation,but I am just "brain" blocked right now.

  • Marina Golubtsova
    Marina Golubtsova almost 11 years
    hmm, what do you think the complexity would be then? I am trying to keep it under O(n^2), all and all, it's still better than what I had in mind >.>
  • user541686
    user541686 almost 11 years
    set is a horrible idea (bad performance). set is for when you insert dynamically, which you don't need here. So just sort the vector and use unique().
  • Marina Golubtsova
    Marina Golubtsova almost 11 years
    @Mehrdad it wont do the task it needs to do, which is to count duplicates, not make the vector without duplicates
  • jrok
    jrok almost 11 years
    A functor like this won't work correctly if it gets copied by the algorithm function.
  • user541686
    user541686 almost 11 years
    @MarinaGolubtsova: I was thinking that calling unique once to remove the unique values would leave you with duplicates; calling unique again will remove the duplicates of those, giving you the result you need. However, it seems like unique actually destroys the rest of the array, so you'd need to modify things a bit to make it work. You still don't need set, though.
  • user541686
    user541686 almost 11 years
    @MarinaGolubtsova: See my answer below.
  • Ben Voigt
    Ben Voigt almost 11 years
    @Mehrdad: It depends on the size of the input. My solution's asymptotic performance is far better than sorting.
  • Ben Voigt
    Ben Voigt almost 11 years
    Well yes, you use the test-whether-it-is-present-and-make-it-present operation. My answer was intended to be a guide, not a 1:1 mapping to member function calls. You can't say I chose poor methods for the details when I didn't give details. And existing should probably be an unordered_set as well.
  • Ben Voigt
    Ben Voigt almost 11 years
    What makes an algorithm better to you? O(N lg N) complexity when an O(N) solution exists is better?
  • user541686
    user541686 almost 11 years
    @BenVoigt: (1) asymptotic performance, (2) empirical performance, (3) the requirements imposed on the caller (e.g. no need for copy constructors, exception-safety), (4) need for auxiliary memory (mine doesn't need to store copies of the input). In any case, your solution also uses set, so it's O(N log N) too, except that set has far worse empirical performance than vector, making this solution better.
  • user541686
    user541686 almost 11 years
    @BenVoigt: WTF, you change your answer then you tell me my comment is wrong? Also FYI, your new solution (1) requires the elements to be hashable and have copy-constructors, (2) isn't compatible with pre-C++11, (3) doesn't guarantee good performance, (4) requires O(N) auxiliary memory, (5) performs lots of memory allocation, which is slow in practice. Mine has none of those problems.
  • Ben Voigt
    Ben Voigt almost 11 years
    @Mehrdad: That's not a substantive change. If you meant to comment "Did you perhaps mean unordered_set?" that would have been useful, but your claims that std::vector is the best are simply crazy.
  • user541686
    user541686 almost 11 years
    If it's "not a substantive change" then why did you bother to make it, just to make the claim that your "solution's asymptotic performance is far better than sorting"? The same wasn't true for set, but you made the change anyway, so you could claim that and render my comment obsolete. And I never said vector was the "best", I just said it was better than your solution.
  • Ben Voigt
    Ben Voigt almost 11 years
    @Mehrdad: What does "set is for when you insert dynamically" mean? Yes I made a mistake on the name of the hashset collection, but my intent was hashset from the very beginning. And no, unordered_set is not "incompatible with pre-C++11", you just need to get it from a different namespace (boost)
  • user541686
    user541686 almost 11 years
    @BenVoigt: It means that you need set if you intend to check for duplicates while you're adding new elements. That's not always the case; sometimes you don't need to check for duplicates until you've finished adding all the items -- and the entire point of std::unique is to handle the other cases.
  • Ben Voigt
    Ben Voigt almost 11 years
    @Mehrdad: Sure, except that std::unique is designed to eliminate duplicates, not return them. And I would expect a version that works on unsorted data to use a hashset internally anyway.
  • user541686
    user541686 almost 11 years
    @BenVoigt: "I would expect"... why? Just because that's the only way you've seen it done? The function I wrote does exactly the same thing as std::unique, except it doesn't destroy the duplicates, which is sufficient for what we need.
  • Ben Voigt
    Ben Voigt almost 11 years
    @Mehrdad: I quote my comment: "a version that works on unsorted data". Yours doesn't. There's no comparison.
  • user541686
    user541686 almost 11 years
    @BenVoigt: Huh? Mine calls sort, so obviously it works on unsorted data too.
  • Ben Voigt
    Ben Voigt almost 11 years
  • user541686
    user541686 almost 11 years
    @BenVoigt: I came to chat, you were away.
  • IntoTheDeep
    IntoTheDeep about 7 years
    Any possible counting for this solution?
  • IntoTheDeep
    IntoTheDeep about 7 years
    Any possible counting for this solution?
  • Jerry Coffin
    Jerry Coffin about 7 years
    @TeodorKolev: I'm not sure exactly what you're asking. If you want a list of unique words, and a count of each, you'd typically use a map<string, int>. Read words and increment the count for each. Then walk through the map and write out the words and the count for each.
  • DanielKO
    DanielKO about 7 years
    @TeodorKolev I'm not sure I understand you.