Pandas DataFrame find the max after Groupby two columns and get counts
10,861
Solution 1
You could work on aggregated data.
In [387]: dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()
In [388]: dff
Out[388]:
userId tag pageId
0 1234123 1 1
1 1234123 4 2
2 3122471 2 1
3 3122471 6 2
4 3122471 15 2
5 3122471 18 3
In [389]: dff.groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()])
Out[389]:
userId
1234123 4
3122471 18
dtype: int64
Solution 2
group the original dataframe by userid
df.groupby('userId').max()['tag']
or
df.groupby('userId', as_index=False)['tag'].max()
Note that the second solution is a factor of two faster
%timeit df.groupby('userId').max()['tag']
# 100 loops, best of 3: 5.69 ms per loop
%timeit df.groupby('userId', as_index=False)['tag'].max()
# 100 loops, best of 3: 2.43 ms per loop
Author by
weigod
Updated on June 09, 2022Comments
-
weigod almost 2 years
I have a dataframe df as following:
userId pageId tag 0 3122471 e852 18 1 3122471 f3e2 18 2 3122471 7e93 18 3 3122471 2768 6 4 3122471 53d9 6 5 3122471 06d7 15 6 3122471 e31c 15 7 3122471 c6f3 2 8 1234123 fjwe 1 9 1234123 eiae 4 10 1234123 ieha 4
After using
df.groupby(['userId', 'tag'])['pageId'].count()
to group the data by userId and tag . I will get:userId tag 3122471 2 1 6 2 15 2 18 3 1234123 1 1 4 2
Now I want to find the tag that each user has the most. Just as following:
userId tag 3122471 18 1234123 4
(Note: if there are multiple tags that has the same count, I want to use a function
my_rule
to determine which to show)