Pandas groupby and filter
I think groupby
is not necessary, use boolean indexing
only if need all rows where V
is 0
:
print (df[df.V == 0])
C ID V YEAR
0 0 1 0 2011
3 33 2 0 2013
5 55 3 0 2014
But if need return all groups where is at least one value of column V
equal 0
add any
, because filter need True
or False
for filtering all rows in group:
print(df.groupby(['ID']).filter(lambda x: (x['V'] == 0).any()))
C ID V YEAR
0 0 1 0 2011
1 11 1 1 2012
2 22 2 1 2012
3 33 2 0 2013
4 44 3 1 2013
5 55 3 0 2014
Better for testing is change column for groupby
- row with 2012
is filter out because no V==0
:
print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == 0).any()))
C ID V YEAR
0 0 1 0 2011
3 33 2 0 2013
4 44 3 1 2013
5 55 3 0 2014
If performance is important use GroupBy.transform
with boolean indexing
:
print(df[(df['V'] == 0).groupby(df['YEAR']).transform('any')])
ID YEAR V C
0 1 2011 0 0
3 2 2013 0 33
4 3 2013 1 44
5 3 2014 0 55
Detail:
print((df['V'] == 0).groupby(df['YEAR']).transform('any'))
0 True
1 False
2 False
3 True
4 True
5 True
Name: V, dtype: bool
Related videos on Youtube
iwbabn
Updated on August 02, 2022Comments
-
iwbabn almost 2 years
I have dataframe:
df = pd.DataFrame({'ID':[1,1,2,2,3,3], 'YEAR' : [2011,2012,2012,2013,2013,2014], 'V': [0,1,1,0,1,0], 'C':[00,11,22,33,44,55]})
I would like to group by ID, and select the row with V = 0 within each group.
This doesn't seem to work:
print(df.groupby(['ID']).filter(lambda x: x['V'] == 0))
Got an error:
TypeError: filter function returned a Series, but expected a scalar bool
How can I use filter to achieve the goal? Thank you.
EDIT: The condition on V may vary for each group, e.g., it could be V==0 for ID 1, V==1 for ID 2, and this info can be available through another DF:
df = pd.DataFrame({'ID':[1,2,3], 'V': [0,1,0])
So how to do row filtering within each group?
-
jezrael over 7 yearsCan you create new question with reference to this? Do you think at least one value in group is
V
with value by another dataframedf = pd.DataFrame({'ID':[1,2,3], 'V': [0,1,0])
? If change it todf = pd.DataFrame({'ID':[1,2,3], 'V': [0,1,2])
it dont return last group so output is{'V': [0, 1, 1, 0], 'ID': [1, 1, 2, 2], 'C': [0, 11, 22, 33], 'YEAR': [2011, 2012, 2012, 2013]}
? -
pythonRcpp almost 7 years@jezrael What if I had 2 strings to check
print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == "abc" or x['V'] == "xyz").any()))
-
jezrael almost 7 yearsI think you need
|
instaedor
(compare arrays) and addparenthesses
-print(df.groupby(['YEAR']).filter(lambda x: ((x['V'] == 0) | (x['V'] == 1)).any()))
-
jezrael almost 7 yearsAnother solution
print(df.groupby(['YEAR']).filter(lambda x: (x['V'] == 0).any() or (x['V'] == 1)).any())
(not sure if same output), but here compare scalars withor
-
pythonRcpp almost 7 yearsI tried
dfnew = df.groupby('OrderID').filter(lambda x: ((x['ResponseType']=='MODIFY_ORDER_REJECT') | x['ResponseType']=='CANCEL_ORDER_REJECT')).any() )
basically my intent is to remove all OrderID that contain contained MODIFY_ORDER_REJECT or CANCEL_ORDER_REJECT anywhere in csv. Can talk on chat for a minute maybe. Thanks -
aerin about 6 yearsthis answer is great!