Algorithm - How to delete duplicate elements in a list efficiently?

21,505

Solution 1

Assuming order matters:

  • Create an empty set S and an empty list M.
  • Scan the list L one element at a time.
  • If the element is in the set S, skip it.
  • Otherwise, add it to M and to S.
  • Repeat for all elements in L.
  • Return M.

In Python:

>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> S = set()
>>> M = []
>>> for e in L:
...     if e in S:
...         continue
...     S.add(e)
...     M.append(e)
... 
>>> M
[2, 1, 4, 3, 5, 6]

If order does not matter:

M = list(set(L))

Solution 2

Special Case: Hashing and Equality

Firstly, we need to determine something about the assumptions, namely the existence of an equals and has function relationship. What do I mean by this? I mean that for the set of source objects S, given any two objects x1 and x2 that are elements of S there exists a (hash) function F such that:

if (x1.equals(x2)) then F(x1) == F(x2)

Java has such a relationship. That allows you to check to duplicates as a near O(1) operation and thus reduces the algorithm to a simple O(n) problem. If order is unimportant, it's a simple one liner:

List result = new ArrayList(new HashSet(inputList));

If order is important:

List outputList = new ArrayList();
Set set = new HashSet();
for (Object item : inputList) {
  if (!set.contains(item)) {
    outputList.add(item);
    set.add(item);
  }
}

You will note that I said "near O(1)". That's because such data structures (as a Java HashMap or HashSet) rely on a method where a portion of the hash code is used to find an element (often called a bucket) in the backing storage. The number of buckets is a power-of-2. That way the index into that list is easy to calculate. hashCode() returns an int. If you have 16 buckets you can find which one to use by ANDing the hashCode with 15, giving you a number from 0 to 15.

When you try and put something in that bucket it may already be occupied. If so then a linear comparison of all entries in that bucket will occur. If the collision rate gets too high or you try to put too many elements in the structure will be grown, typically doubled (but always by a power-of-2) and all the items are placed in their new buckets (based on the new mask). Thus resizing such structures is relatively expensive.

Lookup may also be expensive. Consider this class:

public class A {
  private final int a;

  A(int a) { this.a == a; }

  public boolean equals(Object ob) {
    if (ob.getClass() != getClass()) return false;
    A other = (A)ob;
    return other.a == a;
  }

  public int hashCode() { return 7; }
}

This code is perfectly legal and it fulfills the equals-hashCode contract.

Assuming your set contains nothing but A instances, your insertion/search now turns into an O(n) operation, turning the entire insertion into O(n2).

Obviously this is an extreme example but it's useful to point out that such mechanisms also rely on a relatively good distribution of hashes within the value space the map or set uses.

Finally, it must be said that this is a special case. If you're using a language without this kind of "hashing shortcut" then it's a different story.

General Case: No Ordering

If no ordering function exists for the list then you're stuck with an O(n2) brute-force comparison of every object to every other object. So in Java:

List result = new ArrayList();
for (Object item : inputList) {
  boolean duplicate = false;
  for (Object ob : result) {
    if (ob.equals(item)) {
      duplicate = true;
      break;
    }
  }
  if (!duplicate) {
    result.add(item);
  }
}

General Case: Ordering

If an ordering function exists (as it does with, say, a list of integers or strings) then you sort the list (which is O(n log n)) and then compare each element in the list to the next (O(n)) so the total algorithm is O(n log n). In Java:

Collections.sort(inputList);
List result = new ArrayList();
Object prev = null;
for (Object item : inputList) {
  if (!item.equals(prev)) {
    result.add(item);
  }
  prev = item;
}

Note: the above examples assume no nulls are in the list.

Solution 3

If the order does not matter, you might want to try this algorithm written in Python:

>>> array = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6]
>>> unique = set(array)
>>> list(unique)
[1, 2, 3, 4, 5, 6]

Solution 4

in haskell this would be covered by the nub and nubBy functions

nub :: Eq a => [a] -> [a]
nub [] = []
nub (x:xs) = x : nub (filter (/= x) xs)

nubBy :: (a -> a -> Bool) -> [a] -> [a]
nubBy f [] = []
nubBy f (x:xs) = x : nub (filter (not.f x) xs)

nubBy relaxes the dependence on the Eq typeclass, instead allowing you to define your own equality function to filter duplicates.

These functions work over a list of consistent arbitrary types (e.g. [1,2,"three"] is not allowed in haskell), and they are both order preserving.

In order to make this more efficient, using Data.Map (or implementing a balanced tree) could be used to gather the data into a set (key being the element, and value being the index into the original list in order to be able to get the original ordering back), then gathering the results back into a list and sorting by index. I will try and implement this later.


import qualified Data.Map as Map

undup x = go x Map.empty
    where
        go [] _ = []
        go (x:xs) m case Map.lookup x m of
                         Just _  -> go xs m
                         Nothing -> go xs (Map.insert x True m)

This is a direct translation of @FogleBird's solution. Unfortunately it doesn't work without the import.


a Very basic attempt at replacing Data.Map import would be to implement a tree, something like this

data Tree a = Empty
            | Node a (Tree a) (Tree a)
            deriving (Eq, Show, Read)

insert x Empty = Node x Empty Empty
insert x (Node a left right)
    | x < a = Node a (insert x left) right
    | otherwise = Node a left (insert x right)

lookup x Empty = Nothing --returning maybe type to maintain compatibility with Data.Map
lookup x (Node a left right)
    | x == a = Just x
    | x < a = lookup x left
    | otherwise = lookup x right

an improvement would be to make it autobalancing on insert by maintaining a depth attribute (keeps the tree from degrading into a linked list). This nice thing about this over a hash table is that it only requires your type to be in the typeclass Ord, which is easily derivable for most types.


I take requests it seems. In response to @Jonno_FTWs inquiry here is a solution which completely removes duplicates from the result. It's not entirely dissimilar to the original, simply adding an extra case. However the runtime performance will be much slower since you are going through each sub-list twice, once for the elem, and the second time for the recusion. Also note that now it will not work on infinite lists.

nub [] = []
nub (x:xs) | elem x xs = nub (filter (/=x) xs)
           | otherwise = x : nub xs

Interestingly enough you don't need to filter on the second recursive case because elem has already detected that there are no duplicates.

Solution 5

In Python

>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> a=[]
>>> for i in L:
...   if not i in a:
...     a.append(i)
...
>>> print a
[2, 1, 4, 3, 5, 6]
>>>
Share:
21,505

Related videos on Youtube

psihodelia
Author by

psihodelia

Software Engineer

Updated on July 09, 2022

Comments

  • psihodelia
    psihodelia almost 2 years

    There is a list L. It contains elements of arbitrary type each. How to delete all duplicate elements in such list efficiently? ORDER must be preserved

    Just an algorithm is required, so no import any external library is allowed.

    Related questions

    • codebliss
      codebliss over 14 years
      This is impossible in haskell, as only types can start with capitals =P.
    • dfeuer
      dfeuer about 9 years
      Data.List.nub is not efficient.
  • John La Rooy
    John La Rooy over 14 years
    The method given by FogleBird is O(n), since e in S, S.add and M.append are all O(1)
  • Lordn__n
    Lordn__n over 14 years
    And FYI, I mention that O(1) case (for Java) but, like in Python, it's based on the assumption of there existing an equals-hashcode relationship, which is fine, but it's not the general case.
  • Moishe Lettvin
    Moishe Lettvin over 14 years
    I was about to downvote based on your first sentence "if no ordering you're stuck with O(n^2)" b/c you can solve it with a hashtable. Then I saw your last section about the ArrayList of a HashSet and, well, there ya go. Maybe downvoters didn't read your whole response...?
  • inspectorG4dget
    inspectorG4dget over 14 years
    In your first solution, the set S is not necessary. You should be able to append elements from L M if they are not already in M. That does the same thing without requiring another data structure.
  • David Crawshaw
    David Crawshaw over 14 years
    The set S is necessary to make this algorithm O(n*log(n)), and not O(n^2). Searching for an element in a list is O(n), but it is O(1) in a Set.
  • psihodelia
    psihodelia over 14 years
    it's copy-paste of @FogleBird, isn't?
  • ghostdog74
    ghostdog74 over 14 years
    Only the data L. Can't you see? I am not using sets, just normal list appending.
  • psihodelia
    psihodelia over 14 years
    what if some elements are not hashable?
  • ghostdog74
    ghostdog74 over 14 years
    its better if you put it in your original post to say that you found the solution, since the question was asked by you in the first place
  • Mike Ottum
    Mike Ottum over 14 years
    If the elements are not hashable then you can implement your set using a search tree (as in the STL) and the algorithm will be O( n*log n).
  • Lordn__n
    Lordn__n over 14 years
    Technically it's "near O(1)", which isn't quite the same thing. See my answer.
  • Randall Schulz
    Randall Schulz over 14 years
    For the tree solution to work the elements must be mutually comparable. Only the "naive" n^2 algorithm requires only equality testing, which is the minimum assumption for any problem about uniqueness. (By the way, does the phrasing of the question suggest a homework problem?)
  • Jonno_FTW
    Jonno_FTW over 14 years
    On a sidenote, how can you modify nub toremove both elements, if they are repeated, ie. [1,2,2,3] -> [1,3] ?
  • jfs
    jfs over 14 years
    [(M.append(e) or e) for e in L if e not in M] is less ugly and has the same efficiency (O(n**2)) as 'zip' variant. It is applicable when you can't use set or sort i.e., almost never.
  • jfs
    jfs over 14 years
    Actually M contains the result therefore if you must to do it in one line then: collections.deque((M.append(e) for e in L if e not in M), maxlen=0). Here I've used itertools recipe: consume = lambda it: deque(it, maxlen=0) It performs iterations until the iterator is exhausted. Final result is in the M list. It uses half the memory but time efficiency is the same O(n**2).
  • jfs
    jfs over 14 years
    Your solution for 'General Case: Ordering' doesn't preserve original order (OP requirement). btw, prev = item can be lifted to the if suite.
  • Boris
    Boris over 13 years
    To clarify, what is meant by "near O(1)": A bucket may hold more than one entry, but with proper resizing, it is possible to maintain a load factor below a given threshold, eg. 0.7, giving an O(1) expected or average lookup time. Insertion and deletion time may be O(n) worst case due to resizing, but with the doubling strategy described above, it will run in O(1) amortized time, meaning that a string of n insertions/deletions will run in O(n) time, even though each individual operation may not be O(1).
  • u0b34a0f6ae
    u0b34a0f6ae over 12 years
    Even Data.List.sort is only 20 lines of haskell. See haskell.org/ghc/docs/latest/html/libraries/base-4.4.0.0/src/‌​…