Finding duplicate entries in Collection

14,912

Solution 1

It depends on the semantic of the criterion:

If your criterion is always the same for a given class, and is inherent to the underlying concept, you should just implement equals and hashCode and use a set.

If your criterion depend on the context, org.apache.commons.collections.CollectionUtils.select(java.util.Collection, org.apache.commons.collections.Predicate) might be the right solution for you.

Solution 2

If you want to find duplicates, rather than just removing them, one approach would be to throw the Collection into an array, sort the array via a Comparator that implements your criteria, then linearly walk through the array, looking for adjacent duplicates.

Here's a sketch (not tested):

   MyComparator myComparator = new MyComparator();
   MyType[] myArray = myList.toArray();
   Arrays.sort( myArray, myComparator );
   for ( int i = 1; i < myArray.length; ++i ) {
      if ( 0 == myComparator.compare( myArray[i - 1], myArray[i] )) {
         // Found a duplicate!
      }
   }

Edit: From your comment, you just want to know if there are duplicates. The approach above works for this too. But you could more simply just create a java.util.SortedSet with a custom Comparator. Here's a sketch:

   MyComparator myComparator = new MyComparator();
   TreeSet treeSet = new TreeSet( myComparator );
   treeSet.addAll( myCollection );
   boolean containsDuplicates = (treeSet.size() != myCollection.size()); 

Solution 3

You can adapt a Java set to search for duplicates among objects of an arbitrary type: wrap your target class in a private wrapper that evaluates equality based on your criteria, and construct a set of wrappers.

Here is a somewhat lengthy example that illustrates the technique. It considers two people with the same first name to be equal, and so it detects three duplicates in the array of five objects.

import java.util.*;
import java.lang.*;

class Main {
    static class Person {
        private String first;
        private String last;
        public String getFirst() {return first;}
        public String getLast() {return last;}
        public Person(String f, String l) {
            first = f;
            last = l;
        }
        public String toString() {
            return first+" "+last;
        }
    }
    public static void main (String[] args) throws java.lang.Exception {
        List<Person> people = new ArrayList<Person>();
        people.add(new Person("John", "Smith"));
        people.add(new Person("John", "Scott"));
        people.add(new Person("Jack", "First"));
        people.add(new Person("John", "Walker"));
        people.add(new Person("Jack", "Black"));
        Set<Object> seen = new HashSet<Object>();
        for (Person p : people) {
            final Person thisPerson = p;
            class Wrap {
                public int hashCode() { return thisPerson.getFirst().hashCode(); }
                public boolean equals(Object o) {
                    Wrap other = (Wrap)o;
                    return other.wrapped().getFirst().equals(thisPerson.getFirst());
                }
                public Person wrapped() { return thisPerson; }
            };
            Wrap wrap = new Wrap();
            if (seen.add(wrap)) {
                System.out.println(p + " is new");
            } else {
                System.out.println(p + " is a duplicate");
            }
        }
    }
}

You can play with this example on ideone [link].

Solution 4

You could use a map and while iterating over the collection put the elements into the map (the predicates would form the key) and if there's already an entry you've found a duplicate.

For more information see here: Finding duplicates in a collection

Share:
14,912
Admin
Author by

Admin

Updated on July 18, 2022

Comments

  • Admin
    Admin almost 2 years

    Is there a tool or library to find duplicate entries in a Collection according to specific criteria that can be implemented?


    To make myself clear: I want to compare the entries to each other according to specific criteria. So I think a Predicate returning just true or false isn't enough.


    I can't use equals.

  • Admin
    Admin almost 12 years
    I want to compare the entries between each other, not to arbitrary criteria.
  • Sergey Kalinichenko
    Sergey Kalinichenko almost 12 years
    @dragon66 If your hash function is good, the efficiency is the same as with any hash table, which is O(1) for each item, or O(N) for the entire collection.
  • dragon66
    dragon66 almost 12 years
    dasblinkenlight: I am a bit concerned about the wrap object creation although I know they will be gone outside the loop.
  • Sergey Kalinichenko
    Sergey Kalinichenko almost 12 years
    @dragon66 Java is quite efficient at creating small objects (and these objects are tiny). Unfortunately, Java does not have a concept parallel to .NET's equality comparer - that would allow for a solution that avoids temporary objects altogether.
  • Christian Hujer
    Christian Hujer about 9 years
    As the OP says, he can't use equals(). A HashSet uses hashCode() and equals(). Therefore he cannot use a HashSet.