Pandas - group by id and drop duplicate with threshold

python pandas group-by duplicates threshold

11,113

Solution 1

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.

To drop without a threshold:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

   userid  itemid
4       2       1
5       2       2
6       2       3

Solution 2

`filter`

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

filter and value_counts
Most generalizable and intuitive

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter and is_unique
special case when looking for n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3

Solution 3

Group the dataframe by users and items:

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

Find out who saw any item only once:

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

Combine the results and keep the 'good' rows:

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3

Solution 4

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

If you want to drop at a certain threshold, just set

df2.ix[:,-1]>threshold]

View more solutions

11,113

Author by

Mansumen

Updated on June 11, 2022

Comments

Mansumen almost 2 years
I have the following data:
```
userid itemid
  1       1
  1       1
  1       3
  1       4
  2       1
  2       2
  2       3
```
I want to drop userIDs who has viewed the same itemID more than or equal to twice. For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.

So I want my data to be like the following:
```
userid itemid
  2       1
  2       2
  2       3
```
Can someone help me?
```
import pandas as pd    
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                   'itemid':[1,1,3,4, 1,2,3] })
```
Mansumen about 7 years

What if I want to drop regarding a threshold? That is, what if I want to drop userids with more than 3 duplicates?
smci about 7 years

This is a special-case trick for THRESHOLD = 2. For larger THRESHOLD, you need @DYZ's more general code.
smci about 7 years

for user in df['userid'].drop_duplicates().tolist() can always be replaced by df.groupby('userid')