Removing duplicates in lists

python algorithm list duplicates intersection

1,893,090

Solution 1

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.

Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.

Solution 2

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

Solution 3

It's a one-liner: list(set(source_list)) will do the trick.

A set is something that can't possibly have duplicates.

Update: an order-preserving approach is two lines:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)

Solution 4

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]

Solution 5

If you don't care about the order, just do this:

def remove_duplicates(l):
    return list(set(l))

A set is guaranteed to not have duplicates.

View more solutions

1,893,090

Author by

Neemaximo

Updated on July 14, 2022

Comments

Neemaximo almost 2 years
Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren't duplicated/removed. This is what I have but to be honest I do not know what to do.
```
def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t
```
Herberth Amaral over 11 years

I think this is the only way to keep the items in order.
Martijn Pieters over 10 years

@HerberthAmaral: That is very far from true, see How do you remove duplicates from a list in Python whilst preserving order?
Herberth Amaral over 10 years

@MartijnPieters Correcting: I think this is the only simple way to keep items in order.
dotancohen over 10 years

Note that this method works in O(n^2) time and is thus very slow on large lists.
user227666 almost 10 years

This will convert the list to numpy array which is a mess and won't work for strings.
G M almost 10 years

@user227666 thanks for your review but that's not true it works even with string and you can add .tolist if you want to get a list...
Joop over 9 years

Nice using set quick lookup to speed up the looped comparison. If order does not matter list(set(x)) is still 6x faster than this
volcano over 9 years

@Joop, that was my first question for my colleague - the order does matter; otherwise, it would have been trivial issue
abarnert over 9 years

map with a side effect is even more misleading than a listcomp with a side effect. Also, lambda x: unique_list.append(x) is just a clunkier and slower way to pass unique_list.append.
Martijn Pieters about 8 years

Use enumerate() to get the index faster: for i, value in enumerate(t): if value in t[i + 1:]: t.remove(value)
jermenkoo about 8 years

quadratic complexity nonetheless - in is O(n) operation and your cleanlist will have at most n numbers => worst-case ~O(n^2)
loxaxs almost 8 years

Yet, this solution requires orderable elements. I will use it uniquify my list of lists: it is a pain to tuple() lists and to hash them. | | | | - Generally speaking, the hash process takes a time proportional to the size of the whole data, while this solution takes a time O(nlog(n)), depending only on the length of the list.
dylnmc over 7 years

If you don't care about order, then this takes significantly longer. list(set(..)) (over 1 million passes) will beat this solution by about 10 whole seconds - whereas this approach takes about 12 seconds, list(set(..)) only takes about 2 seconds!
dylnmc over 7 years

seen = set(iterable); for item in seen: yield item is almost certainly faster. (I haven't tried this specific case, but that would be my guess.)
Debosmit Ray over 7 years

I think this is kinda like trying to kill a bee with a sledgehammer. Works, sure! But, importing a library for just this purpose might be a little overkill, no?
G M over 7 years

@DebosmitRay it could be useful if you work in Data Science where usually you work with numpy and many times you need to work with numpy array.
Cyphase over 7 years

@dylnmc, that's a batch operation, and it also loses the ordering. My answer was specifically intended to be on-the-fly and in order of first occurrence. :)
Davide about 7 years

For this too, the content of the original list must be hashable
blubberdiblub about 7 years

This has a time complexity of O(n ^ 2). The answers with set and OrderedDict may have lower amortized time complexity.
ZLNK almost 7 years

Very useful way to append elements in just one line, thanks!
9000 almost 7 years

I think that the set-based approach is equally cheap (O(n log n)), or cheaper, than sorting + detection of uniques. (This approach would parallelize much better, though.) It also does not exactly preserve the initial order, but it gives a predictable order.
Eli Korvigo almost 7 years

@9000 That is true. I've never mentioned time-complexity of a hash-table-based approach, which is obviously O(n). Here you can find many answers incorporating hash-tables. They are not universal, though, because they require objects to be hashable. Moreover, they are a lot more memory-intensive.
Dustin Wyatt over 6 years

Note that in modern Python versions (2.7+ I think, but I don't recall for sure), keys() returns a dictionary view object, not a list.
Eli Korvigo over 6 years

1. You should never shadow builtin names (at least, as important as list); 2. Your method scales extremely bad: it is quadratic in the number of elements in list.
Eli Korvigo over 6 years

@dylnmc this is also a duplicate of a significantly older answer
cgf about 6 years

1. Correct, but this was an example; 2. Correct, and that's exactly the reason why I offered it. All solutions posted here have pros and cons. Some sacrifice simplicity or order, mine sacrifices scalability.
DaFois about 6 years

Rather than putting the row code in this way, can you explain what your code does?
Eli Korvigo about 6 years

This is horribly inefficient. list.index is a linear-time operation, making your solution quadratic.
Gerasimos Ragavanis about 6 years

I used in my code this solution and worked great but I think it is time consuming
Anurag Misra about 6 years

@MeetZaveri glad.!
CraZ almost 6 years

As @Davide mentioned, the original list must hashable. This means, that this does not work for a list of dictionaries. TypeError: unhashable type: 'dictlist'
ShadowRanger almost 6 years

sorted(list(...)) is redundant (sorted already implicitly converts its argument to a new list, sorts it, then returns the new list, so using both means making an unnecessary temporary list). Use only list if the result need not be sorted, use only sorted if the result needs to be sorted.
ShadowRanger almost 6 years

Why compute/store the index if you never use it? This looks like a solution intended to preserve order (by storing last seen index of each value) that forgot to do so. list(set(lst)) would achieve the same logical result.
ShadowRanger almost 6 years

A note: There are built-ins that will break the assumption stated in the final paragraph; frozenset is hashable, set is not, and if they have the same values, they're equal, but you'll treat them as non-equal in this code.
Willem Van Onsem almost 6 years

@ShadowRanger: yes, I agree with that, like said it does not solve all the problems. Nevertheless, by using a set(..) this will simply not work at all, and by using a list, this will result in linear lookup time. So it is meant as a "better" set, but with some pitfalls.
Willem Van Onsem almost 6 years

Furhermore a set(..) also in rare cases returns objects that are not equal. For example math.nan is not equal to math.nan, but the dictionary will return it, since it checks first for reference equality.
Nelewout over 5 years

There appear to be quite a few other answers here. What does this answer offer over the other solutions posted? Furthermore, while this code may answer the question, it lacks explanation. Please consider adding text to explain what it does, and why it answers the question posed.
Gajendra D Ambi over 5 years

it is a oneliner which needs to explanation. Some like/want/understands answers which are like an essay, few others like answers which use python's inbuilt library, some others like answers which don't use python's library, but it is for those who like oneliners which needs no explanation.
Atonal over 5 years

You're right. But also I believe it's fairly obvious the solution is intended to be a one liner that preserves the order. Everything else is already in here.
Jean-François Fabre over 5 years

list comprehensions shouldn't be used for side effects.
Erik Campobadal over 5 years

You can just do: mylist = sorted(list(set(mylist)))
ilias iliadis over 5 years

@blubberdiblub can you explain what more code efficient mechanism exists in set and OrderedDict that could make them less time consuming? (excluding the overhead of loading them)
blubberdiblub over 5 years

@iliasiliadis The usual implementations of set and dict use hashes or (some form of balanced) trees. You have to consider building the set or dict and searching in it (multiple times), but their amortized complexity usually is still lower than O(n ^ 2). "Amortized" in simple terms means on average (they can have worst cases with higher complexity than the average case). This is only relevant when you have a big number of items.
Eli Korvigo about 5 years

@ZLNK please, don't ever use that. Apart from being conceptually ugly, it's also extremely inefficient, because you actually create a potentially large list and throw it away just to perform basic iteration.
Glutexo about 5 years

Though this might work well, using a heavy library like pandas for this purpose seems like an overkill.
vineeshvs about 5 years

Also try it out here
Israel Teixeira almost 5 years

My answer details a more efficient way of skipping repeated values when the constraint of having a pre-ordered list is appreciated
Asclepius almost 5 years

If the original list is not hashable, the more-itertools package has unique_everseen which works with both hashable and unhashable items.
9000 over 4 years

@AdrianKeister: This is true. There are objects that have reasonable equality semantics but are not hashable, e.g. lists. OTOH if we can't have a shortcut like a hastable, we end up with a quadratic algorithm of just comparing every element with all currently known unique elements. This can be totally OK for short inputs, especially with a lot of duplicates.
Adrian Keister over 4 years

Right, exactly. I think your answer would be higher quality if you took this very common use case into account.
sailfish009 over 4 years

add this to example, t = [3, 2, 1, 1, 2, 5, 6, 7, 8], shows the difference clearly!
millerdev over 4 years

"...overhead of creating a dictionary first... If you don’t actually need to preserve the order, you’re better off using a set." — I profiled this because I was curious if it was actually true. My timings show that indeed the set is slightly faster: 1.12 µs per loop (set) vs 1.53 µs per loop (dict) over 1M loops with an absolute time difference of about 4s over 1M iterations. So if you're doing this in a tight inner loop you may care, otherwise probably not.
poke over 4 years

@millerdev I was going to say something like “overhead does not only mean timing” but then I checked and it appears that a keyed dictionary is actually smaller in memory than a set with the same elements. At least in current versions of Python. That’s really surprising – but yes, it’s a good point! Thanks!
Fredrik Erlandsson over 4 years

This solves the issue with unhashable types (where t is a list of dicts): [dict(d) for d in set([frozenset(i.items()) for i in t])]
poke over 4 years

@FredrikErlandsson Note that this highly depends on the actual shape of your dictionaries. If you have a dictionary that contains unhashable values itself, then this won’t be enough.
Z4-tier over 4 years

Instantiating new lists and sets is not free. What happens if we do this many times in quick succession (ie. in a very tight loop), and the lists are very small?
DrD about 4 years

optimized version of ordered set, for anyone who is interested: def unique(iterable): ;seen = set(); seen_add = seen.add; return [item for item in iterable if not item in seen and not seen_add(item)]
Egos about 4 years

the best answer in 2020 @DebosmitRay i hope you change your mind and use numpy / pandas every time you can
Brayoni about 4 years

You could just do list(dict.fromkeys(dup_list))
Brayoni about 4 years

You could just do list(dict.fromkeys(lst))
Brayoni about 4 years

Takes time to read and understand this answer. Is there a point in enumerating when you are not using the indices? The reduce() is already working on a sorted collection srt_enum, why did you apply sorted again?
Eli Korvigo about 4 years

@Brayoni the first sort is there to group equal values, the second sort is there to restore initial order. The enumeration is needed to keep track of original relative order.
BigDreamz over 3 years

@poke time complexity of list(dict.fromkeys(t))?
poke over 3 years

@BigDreamz dict.fromkeys() creates a dictionary in linear time, and list() will create a list from it also in linear time.
ingyhere about 3 years

This only catches duplicates in an ordered list.
ingyhere about 3 years

Unsuitable for large lists as it creates a duplicate.
ingyhere about 3 years

I don't think this deserves a downvote as it truly is a one-liner whereas the others aren't. Improvement: Redefine the same list so that a new list is not held in memory.
Cybernetic about 3 years

@ingyhere The OP did not suggest anything re: large lists. There is a always a tradeoff to every type of implementation, so the premise that every answer must default to "most scalable" is false.
Tomerikoo about 3 years

What's the point of the double loop? You can already do the if x not in check when taking the input. i.e. if n not in lst: lst.append(n) inside the while True loop and then lst will already have the result. No need for uniq at all...
questionto42standswithUkraine over 2 years

Hard to read. Better have a top list at the bottom with the results wrapped up. Thus, for unordered hashables: Do not use: #- ii for n,ii in enumerate(seq) if ii not in seq[:n] #- cnt = Counter(); cnt[Container(x)] += 1 #- cnt = OrderedCounter(); cnt[Container(x)) += 1 #- if i not in new for i in seq. Better use: #- list(set(seq)) #- dict.fromkeys(seq) #- added = set(); for in seq: if not val in added #- OrderedDict.fromkeys(seq) #- OrderedDict((x, True) for x in seq).keys() #- functools.reduce(lambda r, v: v in r[1] and r or ... or ..., ([], set[]))[0]
mike rodent over 2 years

Nice... the one problem is that it's so clever that you kind of have to add a comment to say what it does.
Guido over 2 years

Passing via set() preserve list order?
Timothy C. Quinn almost 2 years

Instead of using key=lambda x: x, you could use key=None and put in k = key(item) if key else item. This should be a tiny bit faster and yield the same result.
Keta almost 2 years

Nice answer, it works if the elements are not hashable. However, if the elements are Numpy arrays, you may get surprises, because the in operator doesn't work as one might expect (at least as I was expecting).