Selecting the first row of a sorted group from pandas data frame

14,729

Solution 1

IIUC you can do it this way:

In [83]: df.groupby('campaignname', as_index=False) \
           .apply(lambda x: x.nlargest(1, columns=['amount'])) \
           .reset_index(level=1, drop=1)
Out[83]:
  campaignname category_type  amount
0            A       cat_A_2     4.0
1            B       cat_B_0     3.0
2            C       cat_C_1     2.0

or:

In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1)
Out[76]:
  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0

Solution 2

My preferred way to do this is with idxmax. It returns the index of the maximum value. I subsequently use that index to slice df

df.loc[df.groupby('campaignname').amount.idxmax()]

  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0
Share:
14,729
user1330974
Author by

user1330974

Updated on June 14, 2022

Comments

  • user1330974
    user1330974 almost 2 years

    Suppose, I have a dataframe in pandas like below:

    campaignname    category_type    amount
    A               cat_A_0            2.0
    A               cat_A_1            1.0
    A               cat_A_2            3.0
    A               cat_A_2            3.0
    A               cat_A_2            4.0
    B               cat_B_0            3.0
    C               cat_C_0            1.0
    C               cat_C_1            2.0
    

    I am using the following code to group the above dataframe (say it's assigned variable name df) by different columns as follows:

    for name, gp in df.groupby('campaignname'):
        sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False)
        # I'd like to know how to select this in a cleaner/more concise way
        first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]]
    

    The purpose of the above code is to first groupby the raw data on campaignname column, then in each of the resulting group, I'd like to group again by both campaignname and category_type, and finally, sort by amount column to choose the first row that comes up (the one with the highest amount in each group. Specifically for the above example, I'd like to get results like this:

    first_row = ['A', 'cat_A_2', 4.0] # for the first group
    first_row = ['B', 'cat_B_0', 3.0] # for the second group
    first_row = ['C', 'cat_C_1', 2.0] # for the third group
    

    etc.

    As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!

  • user1330974
    user1330974 about 7 years
    Thank you for the suggestion. But when I run the code you suggested, it still returns the whole dataframe, but grouped-by and sorted. What I'd like is to iterate over the groups, sort within each group based on amount, select the top row of the sorted result. I guess my question is too vague and maybe I should modify it...
  • user1330974
    user1330974 about 7 years
    Thank you for the suggestion. But when I run the code you suggested, it still returns the whole dataframe, but grouped-by and sorted. What I'd like is to iterate over the groups, sort within each group based on amount, select only the top row of the sorted result in each group iteratively. I guess my question is too vague and maybe I should modify it...
  • piRSquared
    piRSquared about 7 years
    @user1330974 this should return the same thing you asked for. The first row after having sorted within each group. What I've posted shows what you would get, which is what you asked for.
  • user1330974
    user1330974 about 7 years
    after running the df.loc[df.groupby('campaignname').amount.idxmax()], I still see all rows in the data frame... Thank you for your help!