Pandas GroupBy : How to get top n values based on a column

13,663

IIUC you can use function nlargest.

I try your sample data and get top 2 rows by column C:

print df
   A    B        C
0  x   12      ere
1  x   34     bfhg
2  z    6      bgn
3  z    8      rty
4  y  567  hmmu,,u
5  x  545   fghfgj
6  x   44    zxcbv

dcf = df.groupby(['A'],as_index=False).count()
print dcf
   A  B  C
0  x  4  4
1  y  1  1
2  z  2  2

#get 2 largest rows by column C
print dcf.nlargest(2,'C')
   A  B  C
0  x  4  4
2  z  2  2
Share:
13,663
AbtPst
Author by

AbtPst

Interested in learning about Machine Learning and NLP.

Updated on June 05, 2022

Comments

  • AbtPst
    AbtPst almost 2 years

    forgive me if this is a basic question but i am new to pandas. I have a dataframe with with a column A and i would like to get the top n rows based on the count in Column A. For instance the raw data looks like

    A  B  C
    x 12  ere
    x 34  bfhg
    z 6   bgn
    z 8   rty
    y 567 hmmu,,u
    x 545 fghfgj
    x 44  zxcbv
    

    Note that this is just a small sample of the data that i am actually working with.

    So if we look at Column A, value x appears 4 times,y appears 2 times and z appears 1 time. How can i get the top n values for Column A based on this count?

    print df.groupby(['A']).sum()
    

    this gives me

    A      B
    
    x      6792117
    

    but when i do

    print len(df.groupby(['A']).get_group('x'))
    

    i get

    21
    

    furthermore

    len(df.index) 
    

    gives me

    23657
    

    so how can the count of 'A' == 'x' be 6792117 as seen in the result of group by? what am i missing?

    Update

    consider

    print df.groupby(['A']).describe()
    

    gives me

         Tags           DocID
    
    x    count      21.000000
         mean   323434.142857
         std     35677.410292
         min    266631.000000
         25%    292054.000000
         50%    325575.000000
         75%    347450.000000
         max    380286.000000
    

    which makes sense. i just want to get the row which has the max count as per column A.

    Update2

    i did

    print df.groupby(['A'],as_index=False).count()
    

    i get

             A       B      C
    0        x       21     21
    1        y       11     11
    2        z        8      8
    

    so basically, for Column A, tag x has 21 entries in Column B and 21 in Column C. ColumnsB and C are unique in my case. which is good. now how do i get the top n rows with respect to column C?

    Update3

    So i tried

    import heapq
    print heapq.nlargest(3,df.groupby(['A'],as_index=False).count()['C'])
    

    and i get

    [151, 85, 72]
    

    so i know that for Column A, i have the above counts as the top 3 counts. But i still dont know which value of Column A do these counts refer to? For example which value in Column A has a count of 151? Is there any way to link this information?