Removing duplicates in lists

1,893,090

Solution 1

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.

Solution 2

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

Solution 3

It's a one-liner: list(set(source_list)) will do the trick.

A set is something that can't possibly have duplicates.

Update: an order-preserving approach is two lines:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)

Solution 4

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]

Solution 5

If you don't care about the order, just do this:

def remove_duplicates(l):
    return list(set(l))

A set is guaranteed to not have duplicates.

Share:
1,893,090
Neemaximo
Author by

Neemaximo

Updated on July 14, 2022

Comments

  • Neemaximo
    Neemaximo almost 2 years

    Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren't duplicated/removed. This is what I have but to be honest I do not know what to do.

    def remove_duplicates():
        t = ['a', 'b', 'c', 'd']
        t2 = ['a', 'c', 'd']
        for t in t2:
            t.append(t.remove())
        return t
    
  • Herberth Amaral
    Herberth Amaral over 11 years
    I think this is the only way to keep the items in order.
  • Martijn Pieters
    Martijn Pieters over 10 years
    @HerberthAmaral: That is very far from true, see How do you remove duplicates from a list in Python whilst preserving order?
  • Herberth Amaral
    Herberth Amaral over 10 years
    @MartijnPieters Correcting: I think this is the only simple way to keep items in order.
  • dotancohen
    dotancohen over 10 years
    Note that this method works in O(n^2) time and is thus very slow on large lists.
  • user227666
    user227666 almost 10 years
    This will convert the list to numpy array which is a mess and won't work for strings.
  • G M
    G M almost 10 years
    @user227666 thanks for your review but that's not true it works even with string and you can add .tolist if you want to get a list...
  • Joop
    Joop over 9 years
    Nice using set quick lookup to speed up the looped comparison. If order does not matter list(set(x)) is still 6x faster than this
  • volcano
    volcano over 9 years
    @Joop, that was my first question for my colleague - the order does matter; otherwise, it would have been trivial issue
  • abarnert
    abarnert over 9 years
    map with a side effect is even more misleading than a listcomp with a side effect. Also, lambda x: unique_list.append(x) is just a clunkier and slower way to pass unique_list.append.
  • Martijn Pieters
    Martijn Pieters about 8 years
    Use enumerate() to get the index faster: for i, value in enumerate(t): if value in t[i + 1:]: t.remove(value)
  • jermenkoo
    jermenkoo about 8 years
    quadratic complexity nonetheless - in is O(n) operation and your cleanlist will have at most n numbers => worst-case ~O(n^2)
  • loxaxs
    loxaxs almost 8 years
    Yet, this solution requires orderable elements. I will use it uniquify my list of lists: it is a pain to tuple() lists and to hash them. | | | | - Generally speaking, the hash process takes a time proportional to the size of the whole data, while this solution takes a time O(nlog(n)), depending only on the length of the list.
  • dylnmc
    dylnmc over 7 years
    If you don't care about order, then this takes significantly longer. list(set(..)) (over 1 million passes) will beat this solution by about 10 whole seconds - whereas this approach takes about 12 seconds, list(set(..)) only takes about 2 seconds!
  • dylnmc
    dylnmc over 7 years
    seen = set(iterable); for item in seen: yield item is almost certainly faster. (I haven't tried this specific case, but that would be my guess.)
  • Debosmit Ray
    Debosmit Ray over 7 years
    I think this is kinda like trying to kill a bee with a sledgehammer. Works, sure! But, importing a library for just this purpose might be a little overkill, no?
  • G M
    G M over 7 years
    @DebosmitRay it could be useful if you work in Data Science where usually you work with numpy and many times you need to work with numpy array.
  • Cyphase
    Cyphase over 7 years
    @dylnmc, that's a batch operation, and it also loses the ordering. My answer was specifically intended to be on-the-fly and in order of first occurrence. :)
  • Davide
    Davide about 7 years
    For this too, the content of the original list must be hashable
  • blubberdiblub
    blubberdiblub about 7 years
    This has a time complexity of O(n ^ 2). The answers with set and OrderedDict may have lower amortized time complexity.
  • ZLNK
    ZLNK almost 7 years
    Very useful way to append elements in just one line, thanks!
  • 9000
    9000 almost 7 years
    I think that the set-based approach is equally cheap (O(n log n)), or cheaper, than sorting + detection of uniques. (This approach would parallelize much better, though.) It also does not exactly preserve the initial order, but it gives a predictable order.
  • Eli Korvigo
    Eli Korvigo almost 7 years
    @9000 That is true. I've never mentioned time-complexity of a hash-table-based approach, which is obviously O(n). Here you can find many answers incorporating hash-tables. They are not universal, though, because they require objects to be hashable. Moreover, they are a lot more memory-intensive.
  • Dustin Wyatt
    Dustin Wyatt over 6 years
    Note that in modern Python versions (2.7+ I think, but I don't recall for sure), keys() returns a dictionary view object, not a list.
  • Eli Korvigo
    Eli Korvigo over 6 years
    1. You should never shadow builtin names (at least, as important as list); 2. Your method scales extremely bad: it is quadratic in the number of elements in list.
  • Eli Korvigo
    Eli Korvigo over 6 years
    @dylnmc this is also a duplicate of a significantly older answer
  • cgf
    cgf about 6 years
    1. Correct, but this was an example; 2. Correct, and that's exactly the reason why I offered it. All solutions posted here have pros and cons. Some sacrifice simplicity or order, mine sacrifices scalability.
  • DaFois
    DaFois about 6 years
    Rather than putting the row code in this way, can you explain what your code does?
  • Eli Korvigo
    Eli Korvigo about 6 years
    This is horribly inefficient. list.index is a linear-time operation, making your solution quadratic.
  • Gerasimos Ragavanis
    Gerasimos Ragavanis about 6 years
    I used in my code this solution and worked great but I think it is time consuming
  • Anurag Misra
    Anurag Misra about 6 years
    @MeetZaveri glad.!
  • CraZ
    CraZ almost 6 years
    As @Davide mentioned, the original list must hashable. This means, that this does not work for a list of dictionaries. TypeError: unhashable type: 'dictlist'
  • ShadowRanger
    ShadowRanger almost 6 years
    sorted(list(...)) is redundant (sorted already implicitly converts its argument to a new list, sorts it, then returns the new list, so using both means making an unnecessary temporary list). Use only list if the result need not be sorted, use only sorted if the result needs to be sorted.
  • ShadowRanger
    ShadowRanger almost 6 years
    Why compute/store the index if you never use it? This looks like a solution intended to preserve order (by storing last seen index of each value) that forgot to do so. list(set(lst)) would achieve the same logical result.
  • ShadowRanger
    ShadowRanger almost 6 years
    A note: There are built-ins that will break the assumption stated in the final paragraph; frozenset is hashable, set is not, and if they have the same values, they're equal, but you'll treat them as non-equal in this code.
  • Willem Van Onsem
    Willem Van Onsem almost 6 years
    @ShadowRanger: yes, I agree with that, like said it does not solve all the problems. Nevertheless, by using a set(..) this will simply not work at all, and by using a list, this will result in linear lookup time. So it is meant as a "better" set, but with some pitfalls.
  • Willem Van Onsem
    Willem Van Onsem almost 6 years
    Furhermore a set(..) also in rare cases returns objects that are not equal. For example math.nan is not equal to math.nan, but the dictionary will return it, since it checks first for reference equality.
  • Nelewout
    Nelewout over 5 years
    There appear to be quite a few other answers here. What does this answer offer over the other solutions posted? Furthermore, while this code may answer the question, it lacks explanation. Please consider adding text to explain what it does, and why it answers the question posed.
  • Gajendra D Ambi
    Gajendra D Ambi over 5 years
    it is a oneliner which needs to explanation. Some like/want/understands answers which are like an essay, few others like answers which use python's inbuilt library, some others like answers which don't use python's library, but it is for those who like oneliners which needs no explanation.
  • Atonal
    Atonal over 5 years
    You're right. But also I believe it's fairly obvious the solution is intended to be a one liner that preserves the order. Everything else is already in here.
  • Jean-François Fabre
    Jean-François Fabre over 5 years
    list comprehensions shouldn't be used for side effects.
  • Erik Campobadal
    Erik Campobadal over 5 years
    You can just do: mylist = sorted(list(set(mylist)))
  • ilias iliadis
    ilias iliadis over 5 years
    @blubberdiblub can you explain what more code efficient mechanism exists in set and OrderedDict that could make them less time consuming? (excluding the overhead of loading them)
  • blubberdiblub
    blubberdiblub over 5 years
    @iliasiliadis The usual implementations of set and dict use hashes or (some form of balanced) trees. You have to consider building the set or dict and searching in it (multiple times), but their amortized complexity usually is still lower than O(n ^ 2). "Amortized" in simple terms means on average (they can have worst cases with higher complexity than the average case). This is only relevant when you have a big number of items.
  • Eli Korvigo
    Eli Korvigo about 5 years
    @ZLNK please, don't ever use that. Apart from being conceptually ugly, it's also extremely inefficient, because you actually create a potentially large list and throw it away just to perform basic iteration.
  • Glutexo
    Glutexo about 5 years
    Though this might work well, using a heavy library like pandas for this purpose seems like an overkill.
  • vineeshvs
    vineeshvs about 5 years
    Also try it out here
  • Israel Teixeira
    Israel Teixeira almost 5 years
    My answer details a more efficient way of skipping repeated values when the constraint of having a pre-ordered list is appreciated
  • Asclepius
    Asclepius almost 5 years
    If the original list is not hashable, the more-itertools package has unique_everseen which works with both hashable and unhashable items.
  • 9000
    9000 over 4 years
    @AdrianKeister: This is true. There are objects that have reasonable equality semantics but are not hashable, e.g. lists. OTOH if we can't have a shortcut like a hastable, we end up with a quadratic algorithm of just comparing every element with all currently known unique elements. This can be totally OK for short inputs, especially with a lot of duplicates.
  • Adrian Keister
    Adrian Keister over 4 years
    Right, exactly. I think your answer would be higher quality if you took this very common use case into account.
  • sailfish009
    sailfish009 over 4 years
    add this to example, t = [3, 2, 1, 1, 2, 5, 6, 7, 8], shows the difference clearly!
  • millerdev
    millerdev over 4 years
    "...overhead of creating a dictionary first... If you don’t actually need to preserve the order, you’re better off using a set." — I profiled this because I was curious if it was actually true. My timings show that indeed the set is slightly faster: 1.12 µs per loop (set) vs 1.53 µs per loop (dict) over 1M loops with an absolute time difference of about 4s over 1M iterations. So if you're doing this in a tight inner loop you may care, otherwise probably not.
  • poke
    poke over 4 years
    @millerdev I was going to say something like “overhead does not only mean timing” but then I checked and it appears that a keyed dictionary is actually smaller in memory than a set with the same elements. At least in current versions of Python. That’s really surprising – but yes, it’s a good point! Thanks!
  • Fredrik Erlandsson
    Fredrik Erlandsson over 4 years
    This solves the issue with unhashable types (where t is a list of dicts): [dict(d) for d in set([frozenset(i.items()) for i in t])]
  • poke
    poke over 4 years
    @FredrikErlandsson Note that this highly depends on the actual shape of your dictionaries. If you have a dictionary that contains unhashable values itself, then this won’t be enough.
  • Z4-tier
    Z4-tier over 4 years
    Instantiating new lists and sets is not free. What happens if we do this many times in quick succession (ie. in a very tight loop), and the lists are very small?
  • DrD
    DrD about 4 years
    optimized version of ordered set, for anyone who is interested: def unique(iterable): ;seen = set(); seen_add = seen.add; return [item for item in iterable if not item in seen and not seen_add(item)]
  • Egos
    Egos about 4 years
    the best answer in 2020 @DebosmitRay i hope you change your mind and use numpy / pandas every time you can
  • Brayoni
    Brayoni about 4 years
    You could just do list(dict.fromkeys(dup_list))
  • Brayoni
    Brayoni about 4 years
    You could just do list(dict.fromkeys(lst))
  • Brayoni
    Brayoni about 4 years
    Takes time to read and understand this answer. Is there a point in enumerating when you are not using the indices? The reduce() is already working on a sorted collection srt_enum, why did you apply sorted again?
  • Eli Korvigo
    Eli Korvigo about 4 years
    @Brayoni the first sort is there to group equal values, the second sort is there to restore initial order. The enumeration is needed to keep track of original relative order.
  • BigDreamz
    BigDreamz over 3 years
    @poke time complexity of list(dict.fromkeys(t))?
  • poke
    poke over 3 years
    @BigDreamz dict.fromkeys() creates a dictionary in linear time, and list() will create a list from it also in linear time.
  • ingyhere
    ingyhere about 3 years
    This only catches duplicates in an ordered list.
  • ingyhere
    ingyhere about 3 years
    Unsuitable for large lists as it creates a duplicate.
  • ingyhere
    ingyhere about 3 years
    I don't think this deserves a downvote as it truly is a one-liner whereas the others aren't. Improvement: Redefine the same list so that a new list is not held in memory.
  • Cybernetic
    Cybernetic about 3 years
    @ingyhere The OP did not suggest anything re: large lists. There is a always a tradeoff to every type of implementation, so the premise that every answer must default to "most scalable" is false.
  • Tomerikoo
    Tomerikoo about 3 years
    What's the point of the double loop? You can already do the if x not in check when taking the input. i.e. if n not in lst: lst.append(n) inside the while True loop and then lst will already have the result. No need for uniq at all...
  • questionto42standswithUkraine
    questionto42standswithUkraine over 2 years
    Hard to read. Better have a top list at the bottom with the results wrapped up. Thus, for unordered hashables: Do not use: #- ii for n,ii in enumerate(seq) if ii not in seq[:n] #- cnt = Counter(); cnt[Container(x)] += 1 #- cnt = OrderedCounter(); cnt[Container(x)) += 1 #- if i not in new for i in seq. Better use: #- list(set(seq)) #- dict.fromkeys(seq) #- added = set(); for in seq: if not val in added #- OrderedDict.fromkeys(seq) #- OrderedDict((x, True) for x in seq).keys() #- functools.reduce(lambda r, v: v in r[1] and r or ... or ..., ([], set[]))[0]
  • mike rodent
    mike rodent over 2 years
    Nice... the one problem is that it's so clever that you kind of have to add a comment to say what it does.
  • Guido
    Guido over 2 years
    Passing via set() preserve list order?
  • Timothy C. Quinn
    Timothy C. Quinn almost 2 years
    Instead of using key=lambda x: x, you could use key=None and put in k = key(item) if key else item. This should be a tiny bit faster and yield the same result.
  • Keta
    Keta almost 2 years
    Nice answer, it works if the elements are not hashable. However, if the elements are Numpy arrays, you may get surprises, because the in operator doesn't work as one might expect (at least as I was expecting).