Pandas DataFrame find the max after Groupby two columns and get counts

10,861

Solution 1

You could work on aggregated data.

In [387]: dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()

In [388]: dff
Out[388]:
    userId  tag  pageId
0  1234123    1       1
1  1234123    4       2
2  3122471    2       1
3  3122471    6       2
4  3122471   15       2
5  3122471   18       3

In [389]: dff.groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()])
Out[389]:
userId
1234123     4
3122471    18
dtype: int64

Solution 2

group the original dataframe by userid

 df.groupby('userId').max()['tag']

or

 df.groupby('userId', as_index=False)['tag'].max()

Note that the second solution is a factor of two faster

%timeit df.groupby('userId').max()['tag']
# 100 loops, best of 3: 5.69 ms per loop
%timeit df.groupby('userId', as_index=False)['tag'].max()
# 100 loops, best of 3: 2.43 ms per loop
Share:
10,861
weigod
Author by

weigod

Updated on June 09, 2022

Comments

  • weigod
    weigod almost 2 years

    I have a dataframe df as following:

       userId  pageId  tag
    0  3122471  e852   18
    1  3122471  f3e2   18
    2  3122471  7e93   18
    3  3122471  2768    6
    4  3122471  53d9    6
    5  3122471  06d7   15
    6  3122471  e31c   15
    7  3122471  c6f3    2
    8  1234123  fjwe    1
    9  1234123  eiae    4
    10 1234123  ieha    4
    

    After using df.groupby(['userId', 'tag'])['pageId'].count() to group the data by userId and tag . I will get:

    userId   tag
    3122471  2      1
             6      2
             15     2
             18     3
    1234123   1     1
              4     2
    

    Now I want to find the tag that each user has the most. Just as following:

    userId   tag
    3122471  18
    1234123   4
    

    (Note: if there are multiple tags that has the same count, I want to use a function my_rule to determine which to show)