Pandas - group by id and drop duplicate with threshold

11,113

Solution 1

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.

To drop without a threshold:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

   userid  itemid
4       2       1
5       2       2
6       2       3

Solution 2

filter

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

filter and value_counts
Most generalizable and intuitive

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter and is_unique
special case when looking for n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3

Solution 3

Group the dataframe by users and items:

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

Find out who saw any item only once:

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

Combine the results and keep the 'good' rows:

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3

Solution 4

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

If you want to drop at a certain threshold, just set

df2.ix[:,-1]>threshold]
Share:
11,113
Mansumen
Author by

Mansumen

Updated on June 11, 2022

Comments

  • Mansumen
    Mansumen almost 2 years

    I have the following data:

    userid itemid
      1       1
      1       1
      1       3
      1       4
      2       1
      2       2
      2       3
    

    I want to drop userIDs who has viewed the same itemID more than or equal to twice. For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.

    So I want my data to be like the following:

    userid itemid
      2       1
      2       2
      2       3
    

    Can someone help me?

    import pandas as pd    
    df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                       'itemid':[1,1,3,4, 1,2,3] })
    
  • Mansumen
    Mansumen about 7 years
    What if I want to drop regarding a threshold? That is, what if I want to drop userids with more than 3 duplicates?
  • smci
    smci about 7 years
    This is a special-case trick for THRESHOLD = 2. For larger THRESHOLD, you need @DYZ's more general code.
  • smci
    smci about 7 years
    for user in df['userid'].drop_duplicates().tolist() can always be replaced by df.groupby('userid')