Removing duplicates in lists
Solution 1
The common approach to get a unique collection of items is to use a set
. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set()
function. If you later need a real list again, you can similarly pass the set to the list()
function.
The following example should cover whatever you are trying to do:
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]
As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.
Maintaining order
If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict
to keep the order of keys during insertion:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):
>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.
Finally note that both the set
as well as the OrderedDict
/dict
solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.
Solution 2
In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Solution 3
It's a one-liner: list(set(source_list))
will do the trick.
A set
is something that can't possibly have duplicates.
Update: an order-preserving approach is two lines:
from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()
Here we use the fact that OrderedDict
remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True
as values, but we could insert anything, values are just not used. (set
works a lot like a dict
with ignored values, too.)
Solution 4
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
if i not in s:
s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
Solution 5
If you don't care about the order, just do this:
def remove_duplicates(l):
return list(set(l))
A set
is guaranteed to not have duplicates.
Neemaximo
Updated on July 14, 2022Comments
-
Neemaximo almost 2 years
Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren't duplicated/removed. This is what I have but to be honest I do not know what to do.
def remove_duplicates(): t = ['a', 'b', 'c', 'd'] t2 = ['a', 'c', 'd'] for t in t2: t.append(t.remove()) return t
-
Herberth Amaral over 11 yearsI think this is the only way to keep the items in order.
-
Martijn Pieters over 10 years@HerberthAmaral: That is very far from true, see How do you remove duplicates from a list in Python whilst preserving order?
-
Herberth Amaral over 10 years@MartijnPieters Correcting: I think this is the only simple way to keep items in order.
-
dotancohen over 10 yearsNote that this method works in O(n^2) time and is thus very slow on large lists.
-
user227666 almost 10 yearsThis will convert the list to numpy array which is a mess and won't work for strings.
-
G M almost 10 years@user227666 thanks for your review but that's not true it works even with string and you can add .tolist if you want to get a list...
-
Joop over 9 yearsNice using set quick lookup to speed up the looped comparison. If order does not matter list(set(x)) is still 6x faster than this
-
volcano over 9 years@Joop, that was my first question for my colleague - the order does matter; otherwise, it would have been trivial issue
-
abarnert over 9 years
map
with a side effect is even more misleading than a listcomp with a side effect. Also,lambda x: unique_list.append(x)
is just a clunkier and slower way to passunique_list.append
. -
Martijn Pieters about 8 yearsUse
enumerate()
to get the index faster:for i, value in enumerate(t): if value in t[i + 1:]: t.remove(value)
-
jermenkoo about 8 yearsquadratic complexity nonetheless -
in
is O(n) operation and yourcleanlist
will have at mostn
numbers => worst-case ~O(n^2) -
loxaxs almost 8 yearsYet, this solution requires orderable elements. I will use it uniquify my list of lists: it is a pain to
tuple()
lists and to hash them. | | | | - Generally speaking, the hash process takes a time proportional to the size of the whole data, while this solution takes a time O(nlog(n)), depending only on the length of the list. -
dylnmc over 7 yearsIf you don't care about order, then this takes significantly longer.
list(set(..))
(over 1 million passes) will beat this solution by about 10 whole seconds - whereas this approach takes about 12 seconds,list(set(..))
only takes about 2 seconds! -
dylnmc over 7 years
seen = set(iterable); for item in seen: yield item
is almost certainly faster. (I haven't tried this specific case, but that would be my guess.) -
Debosmit Ray over 7 yearsI think this is kinda like trying to kill a bee with a sledgehammer. Works, sure! But, importing a library for just this purpose might be a little overkill, no?
-
G M over 7 years@DebosmitRay it could be useful if you work in Data Science where usually you work with numpy and many times you need to work with numpy array.
-
Cyphase over 7 years@dylnmc, that's a batch operation, and it also loses the ordering. My answer was specifically intended to be on-the-fly and in order of first occurrence. :)
-
Davide about 7 yearsFor this too, the content of the original list must be hashable
-
blubberdiblub about 7 yearsThis has a time complexity of O(n ^ 2). The answers with
set
andOrderedDict
may have lower amortized time complexity. -
ZLNK almost 7 yearsVery useful way to append elements in just one line, thanks!
-
9000 almost 7 yearsI think that the set-based approach is equally cheap (O(n log n)), or cheaper, than sorting + detection of uniques. (This approach would parallelize much better, though.) It also does not exactly preserve the initial order, but it gives a predictable order.
-
Eli Korvigo almost 7 years@9000 That is true. I've never mentioned time-complexity of a hash-table-based approach, which is obviously O(n). Here you can find many answers incorporating hash-tables. They are not universal, though, because they require objects to be hashable. Moreover, they are a lot more memory-intensive.
-
Dustin Wyatt over 6 yearsNote that in modern Python versions (2.7+ I think, but I don't recall for sure),
keys()
returns a dictionary view object, not a list. -
Eli Korvigo over 6 years1. You should never shadow builtin names (at least, as important as
list
); 2. Your method scales extremely bad: it is quadratic in the number of elements inlist
. -
Eli Korvigo over 6 years@dylnmc this is also a duplicate of a significantly older answer
-
cgf about 6 years1. Correct, but this was an example; 2. Correct, and that's exactly the reason why I offered it. All solutions posted here have pros and cons. Some sacrifice simplicity or order, mine sacrifices scalability.
-
DaFois about 6 yearsRather than putting the row code in this way, can you explain what your code does?
-
Eli Korvigo about 6 yearsThis is horribly inefficient.
list.index
is a linear-time operation, making your solution quadratic. -
Gerasimos Ragavanis about 6 yearsI used in my code this solution and worked great but I think it is time consuming
-
Anurag Misra about 6 years@MeetZaveri glad.!
-
CraZ almost 6 yearsAs @Davide mentioned, the original list must hashable. This means, that this does not work for a list of dictionaries.
TypeError: unhashable type: 'dictlist'
-
ShadowRanger almost 6 years
sorted(list(...))
is redundant (sorted
already implicitly converts its argument to a newlist
, sorts it, then returns the newlist
, so using both means making an unnecessary temporarylist
). Use onlylist
if the result need not be sorted, use onlysorted
if the result needs to be sorted. -
ShadowRanger almost 6 yearsWhy compute/store the index if you never use it? This looks like a solution intended to preserve order (by storing last seen index of each value) that forgot to do so.
list(set(lst))
would achieve the same logical result. -
ShadowRanger almost 6 yearsA note: There are built-ins that will break the assumption stated in the final paragraph;
frozenset
is hashable,set
is not, and if they have the same values, they're equal, but you'll treat them as non-equal in this code. -
Willem Van Onsem almost 6 years@ShadowRanger: yes, I agree with that, like said it does not solve all the problems. Nevertheless, by using a
set(..)
this will simply not work at all, and by using alist
, this will result in linear lookup time. So it is meant as a "better" set, but with some pitfalls. -
Willem Van Onsem almost 6 yearsFurhermore a
set(..)
also in rare cases returns objects that are not equal. For examplemath.nan
is not equal tomath.nan
, but the dictionary will return it, since it checks first for reference equality. -
Nelewout over 5 yearsThere appear to be quite a few other answers here. What does this answer offer over the other solutions posted? Furthermore, while this code may answer the question, it lacks explanation. Please consider adding text to explain what it does, and why it answers the question posed.
-
Gajendra D Ambi over 5 yearsit is a oneliner which needs to explanation. Some like/want/understands answers which are like an essay, few others like answers which use python's inbuilt library, some others like answers which don't use python's library, but it is for those who like oneliners which needs no explanation.
-
Atonal over 5 yearsYou're right. But also I believe it's fairly obvious the solution is intended to be a one liner that preserves the order. Everything else is already in here.
-
Jean-François Fabre over 5 yearslist comprehensions shouldn't be used for side effects.
-
Erik Campobadal over 5 yearsYou can just do: mylist = sorted(list(set(mylist)))
-
ilias iliadis over 5 years@blubberdiblub can you explain what more code efficient mechanism exists in set and OrderedDict that could make them less time consuming? (excluding the overhead of loading them)
-
blubberdiblub over 5 years@iliasiliadis The usual implementations of set and dict use hashes or (some form of balanced) trees. You have to consider building the set or dict and searching in it (multiple times), but their amortized complexity usually is still lower than O(n ^ 2). "Amortized" in simple terms means on average (they can have worst cases with higher complexity than the average case). This is only relevant when you have a big number of items.
-
Eli Korvigo about 5 years@ZLNK please, don't ever use that. Apart from being conceptually ugly, it's also extremely inefficient, because you actually create a potentially large list and throw it away just to perform basic iteration.
-
Glutexo about 5 yearsThough this might work well, using a heavy library like pandas for this purpose seems like an overkill.
-
vineeshvs about 5 yearsAlso try it out here
-
Israel Teixeira almost 5 yearsMy answer details a more efficient way of skipping repeated values when the constraint of having a pre-ordered list is appreciated
-
Asclepius almost 5 yearsIf the original list is not hashable, the more-itertools package has
unique_everseen
which works with both hashable and unhashable items. -
9000 over 4 years@AdrianKeister: This is true. There are objects that have reasonable equality semantics but are not hashable, e.g. lists. OTOH if we can't have a shortcut like a hastable, we end up with a quadratic algorithm of just comparing every element with all currently known unique elements. This can be totally OK for short inputs, especially with a lot of duplicates.
-
Adrian Keister over 4 yearsRight, exactly. I think your answer would be higher quality if you took this very common use case into account.
-
sailfish009 over 4 yearsadd this to example, t = [3, 2, 1, 1, 2, 5, 6, 7, 8], shows the difference clearly!
-
millerdev over 4 years"...overhead of creating a dictionary first... If you don’t actually need to preserve the order, you’re better off using a set." — I profiled this because I was curious if it was actually true. My timings show that indeed the set is slightly faster: 1.12 µs per loop (set) vs 1.53 µs per loop (dict) over 1M loops with an absolute time difference of about 4s over 1M iterations. So if you're doing this in a tight inner loop you may care, otherwise probably not.
-
poke over 4 years@millerdev I was going to say something like “overhead does not only mean timing” but then I checked and it appears that a keyed dictionary is actually smaller in memory than a set with the same elements. At least in current versions of Python. That’s really surprising – but yes, it’s a good point! Thanks!
-
Fredrik Erlandsson over 4 yearsThis solves the issue with unhashable types (where t is a list of dicts):
[dict(d) for d in set([frozenset(i.items()) for i in t])]
-
poke over 4 years@FredrikErlandsson Note that this highly depends on the actual shape of your dictionaries. If you have a dictionary that contains unhashable values itself, then this won’t be enough.
-
Z4-tier over 4 yearsInstantiating new lists and sets is not free. What happens if we do this many times in quick succession (ie. in a very tight loop), and the lists are very small?
-
DrD about 4 yearsoptimized version of ordered set, for anyone who is interested:
def unique(iterable):
;seen = set()
;seen_add = seen.add
;return [item for item in iterable if not item in seen and not seen_add(item)]
-
Egos about 4 yearsthe best answer in 2020 @DebosmitRay i hope you change your mind and use numpy / pandas every time you can
-
Brayoni about 4 yearsYou could just do
list(dict.fromkeys(dup_list))
-
Brayoni about 4 yearsYou could just do
list(dict.fromkeys(lst))
-
Brayoni about 4 yearsTakes time to read and understand this answer. Is there a point in enumerating when you are not using the indices? The
reduce()
is already working on a sorted collectionsrt_enum
, why did you applysorted
again? -
Eli Korvigo about 4 years@Brayoni the first sort is there to group equal values, the second sort is there to restore initial order. The enumeration is needed to keep track of original relative order.
-
BigDreamz over 3 years@poke time complexity of list(dict.fromkeys(t))?
-
poke over 3 years@BigDreamz
dict.fromkeys()
creates a dictionary in linear time, andlist()
will create a list from it also in linear time. -
ingyhere about 3 yearsThis only catches duplicates in an ordered list.
-
ingyhere about 3 yearsUnsuitable for large lists as it creates a duplicate.
-
ingyhere about 3 yearsI don't think this deserves a downvote as it truly is a one-liner whereas the others aren't. Improvement: Redefine the same list so that a new list is not held in memory.
-
Cybernetic about 3 years@ingyhere The OP did not suggest anything re: large lists. There is a always a tradeoff to every type of implementation, so the premise that every answer must default to "most scalable" is false.
-
Tomerikoo about 3 yearsWhat's the point of the double loop? You can already do the
if x not in
check when taking the input. i.e.if n not in lst: lst.append(n)
inside thewhile True
loop and thenlst
will already have the result. No need foruniq
at all... -
questionto42standswithUkraine over 2 yearsHard to read. Better have a top list at the bottom with the results wrapped up. Thus, for unordered hashables: Do not use: #- ii for n,ii in enumerate(seq) if ii not in seq[:n] #- cnt = Counter(); cnt[Container(x)] += 1 #- cnt = OrderedCounter(); cnt[Container(x)) += 1 #- if i not in new for i in seq. Better use: #- list(set(seq)) #- dict.fromkeys(seq) #- added = set(); for in seq: if not val in added #- OrderedDict.fromkeys(seq) #- OrderedDict((x, True) for x in seq).keys() #- functools.reduce(lambda r, v: v in r[1] and r or ... or ..., ([], set[]))[0]
-
mike rodent over 2 yearsNice... the one problem is that it's so clever that you kind of have to add a comment to say what it does.
-
Guido over 2 yearsPassing via set() preserve list order?
-
Timothy C. Quinn almost 2 yearsInstead of using
key=lambda x: x
, you could usekey=None
and put ink = key(item) if key else item
. This should be a tiny bit faster and yield the same result. -
Keta almost 2 yearsNice answer, it works if the elements are not hashable. However, if the elements are Numpy arrays, you may get surprises, because the
in
operator doesn't work as one might expect (at least as I was expecting).