Pandas groupby and filter

28,468

I think groupby is not necessary, use boolean indexing only if need all rows where V is 0:

print (df[df.V == 0])
    C  ID  V  YEAR
0   0   1  0  2011
3  33   2  0  2013
5  55   3  0  2014

But if need return all groups where is at least one value of column V equal 0 add any, because filter need True or False for filtering all rows in group:

print(df.groupby(['ID']).filter(lambda x: (x['V'] == 0).any())) 
    C  ID  V  YEAR
0   0   1  0  2011
1  11   1  1  2012
2  22   2  1  2012
3  33   2  0  2013
4  44   3  1  2013
5  55   3  0  2014

Better for testing is change column for groupby - row with 2012 is filter out because no V==0:

print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == 0).any())) 
    C  ID  V  YEAR
0   0   1  0  2011
3  33   2  0  2013
4  44   3  1  2013
5  55   3  0  2014

If performance is important use GroupBy.transform with boolean indexing:

print(df[(df['V'] == 0).groupby(df['YEAR']).transform('any')]) 
   ID  YEAR  V   C
0   1  2011  0   0
3   2  2013  0  33
4   3  2013  1  44
5   3  2014  0  55

Detail:

print((df['V'] == 0).groupby(df['YEAR']).transform('any')) 
0     True
1    False
2    False
3     True
4     True
5     True
Name: V, dtype: bool
Share:
28,468

Related videos on Youtube

iwbabn
Author by

iwbabn

Updated on August 02, 2022

Comments

  • iwbabn
    iwbabn almost 2 years

    I have dataframe:

    df = pd.DataFrame({'ID':[1,1,2,2,3,3], 
                       'YEAR' : [2011,2012,2012,2013,2013,2014], 
                       'V': [0,1,1,0,1,0],
                       'C':[00,11,22,33,44,55]})
    

    I would like to group by ID, and select the row with V = 0 within each group.

    This doesn't seem to work:

    print(df.groupby(['ID']).filter(lambda x: x['V'] == 0)) 
    

    Got an error:

    TypeError: filter function returned a Series, but expected a scalar bool

    How can I use filter to achieve the goal? Thank you.

    EDIT: The condition on V may vary for each group, e.g., it could be V==0 for ID 1, V==1 for ID 2, and this info can be available through another DF:

    df = pd.DataFrame({'ID':[1,2,3], 
                       'V': [0,1,0])
    

    So how to do row filtering within each group?

  • jezrael
    jezrael over 7 years
    Can you create new question with reference to this? Do you think at least one value in group is V with value by another dataframe df = pd.DataFrame({'ID':[1,2,3], 'V': [0,1,0]) ? If change it to df = pd.DataFrame({'ID':[1,2,3], 'V': [0,1,2]) it dont return last group so output is {'V': [0, 1, 1, 0], 'ID': [1, 1, 2, 2], 'C': [0, 11, 22, 33], 'YEAR': [2011, 2012, 2012, 2013]}?
  • pythonRcpp
    pythonRcpp almost 7 years
    @jezrael What if I had 2 strings to check print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == "abc" or x['V'] == "xyz").any()))
  • jezrael
    jezrael almost 7 years
    I think you need | instaed or (compare arrays) and add parenthesses - print(df.groupby(['YEAR']).filter(lambda x: ((x['V'] == 0) | (x['V'] == 1)).any()))
  • jezrael
    jezrael almost 7 years
    Another solution print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == 0).any() or (x['V'] == 1)).any()) (not sure if same output), but here compare scalars with or
  • pythonRcpp
    pythonRcpp almost 7 years
    I tried dfnew = df.groupby('OrderID').filter(lambda x: ((x['ResponseType']=='MODIFY_ORDER_REJECT') | x['ResponseType']=='CANCEL_ORDER_REJECT')).any() ) basically my intent is to remove all OrderID that contain contained MODIFY_ORDER_REJECT or CANCEL_ORDER_REJECT anywhere in csv. Can talk on chat for a minute maybe. Thanks
  • aerin
    aerin about 6 years
    this answer is great!