Remove duplicate JSON objects from list in python

22,591

Solution 1

You can easily remove duplicate keys by dictionary comprehension, since dictionary does not allow duplicate keys, as below-

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
          "Name": "Bala1",
          "phone": "None"
      }      
    ]

unique = { each['Name'] : each for each in te }.values()

print unique

Output-

[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]

Solution 2

Because you can't add a dict to set. From this question:

You're trying to use a dict as a key to another dict or in a set. That does not work because the keys have to be hashable.

As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).

>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> 

Instead, you're already using if x not in seen, so just use a list:

>>> te = [
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       }
...     ]

>>> def removeduplicate(it):
...     seen = []
...     for x in it:
...         if x not in seen:
...             yield x
...             seen.append(x)

>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>

>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>> 

Solution 3

You can still use a set for duplicate detection, you just need to convert the dictionary into something hashable such as a tuple. Your dictionaries can be converted to tuples by tuple(d.items()) where d is a dictionary. Applying that to your generator function:

def removeduplicate(it):
    seen = set()
    for x in it:
        t = tuple(x.items())
        if t not in seen:
            yield x
            seen.add(t)

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}

>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}

This provides faster lookup (avg. O(1)) than a "seen" list (O(n)). Whether it is worth the extra computation of converting every dict into a tuple depends on the number of dictionaries that you have and how many duplicates there are. If there are a lot of duplicates, a "seen" list will grow quite large, and testing whether a dict has already been seen could become an expensive operation. This might justify the tuple conversion - you would have to test/profile it.

Share:
22,591
Tony Roczz
Author by

Tony Roczz

Updated on November 28, 2020

Comments

  • Tony Roczz
    Tony Roczz over 3 years

    I have a list of dict where a particular value is repeated multiple times, and I would like to remove the duplicate values.

    My list:

    te = [
          {
            "Name": "Bala",
            "phone": "None"
          },
          {
            "Name": "Bala",
            "phone": "None"
          },
          {
            "Name": "Bala",
            "phone": "None"
          },
          {
            "Name": "Bala",
            "phone": "None"
          }
        ]
    

    function to remove duplicate values:

    def removeduplicate(it):
        seen = set()
        for x in it:
            if x not in seen:
                yield x
                seen.add(x)
    

    When I call this function I get generator object.

    <generator object removeduplicate at 0x0170B6E8>
    

    When I try to iterate over the generator I get TypeError: unhashable type: 'dict'

    Is there a way to remove the duplicate values or to iterate over the generator

  • Thomas Guyot-Sionnest
    Thomas Guyot-Sionnest over 8 years
    Really nice, I'll keep that in my backpocket. OTOH please note this is not exactly the same as the OP's function as he's checking the full dict, in your case you'll discard any dict that has the same Name, whenever different or not.
  • Thomas Guyot-Sionnest
    Thomas Guyot-Sionnest over 8 years
    Actually, after testing, this would be more like it: unique = { repr(each): each for each in te }.values()
  • mhawke
    mhawke over 8 years
    The OP has accepted it, but I am not sure that this answer is correct considering that it replaces (from list te) previous dicts with later dicts, i.e. it loses data. E.g. if te contained another dict {'Name': 'Bala', 'phone': '1234'}, only the last item in te with name Bala will be retained.