Selecting the first row of a sorted group from pandas data frame
Solution 1
IIUC you can do it this way:
In [83]: df.groupby('campaignname', as_index=False) \
.apply(lambda x: x.nlargest(1, columns=['amount'])) \
.reset_index(level=1, drop=1)
Out[83]:
campaignname category_type amount
0 A cat_A_2 4.0
1 B cat_B_0 3.0
2 C cat_C_1 2.0
or:
In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1)
Out[76]:
campaignname category_type amount
4 A cat_A_2 4.0
5 B cat_B_0 3.0
7 C cat_C_1 2.0
Solution 2
My preferred way to do this is with idxmax
. It returns the index of the maximum value. I subsequently use that index to slice df
df.loc[df.groupby('campaignname').amount.idxmax()]
campaignname category_type amount
4 A cat_A_2 4.0
5 B cat_B_0 3.0
7 C cat_C_1 2.0
user1330974
Updated on June 14, 2022Comments
-
user1330974 almost 2 years
Suppose, I have a dataframe in pandas like below:
campaignname category_type amount A cat_A_0 2.0 A cat_A_1 1.0 A cat_A_2 3.0 A cat_A_2 3.0 A cat_A_2 4.0 B cat_B_0 3.0 C cat_C_0 1.0 C cat_C_1 2.0
I am using the following code to group the above dataframe (say it's assigned variable name
df
) by different columns as follows:for name, gp in df.groupby('campaignname'): sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False) # I'd like to know how to select this in a cleaner/more concise way first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]]
The purpose of the above code is to first
groupby
the raw data oncampaignname
column, then in each of the resulting group, I'd like to group again by bothcampaignname
andcategory_type
, and finally, sort byamount
column to choose the first row that comes up (the one with the highestamount
in each group. Specifically for the above example, I'd like to get results like this:first_row = ['A', 'cat_A_2', 4.0] # for the first group first_row = ['B', 'cat_B_0', 3.0] # for the second group first_row = ['C', 'cat_C_1', 2.0] # for the third group
etc.
As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!
-
user1330974 about 7 yearsThank you for the suggestion. But when I run the code you suggested, it still returns the whole dataframe, but grouped-by and sorted. What I'd like is to iterate over the groups, sort within each group based on
amount
, select the top row of the sorted result. I guess my question is too vague and maybe I should modify it... -
user1330974 about 7 yearsThank you for the suggestion. But when I run the code you suggested, it still returns the whole dataframe, but grouped-by and sorted. What I'd like is to iterate over the groups, sort within each group based on amount, select only the top row of the sorted result in each group iteratively. I guess my question is too vague and maybe I should modify it...
-
piRSquared about 7 years@user1330974 this should return the same thing you asked for. The first row after having sorted within each group. What I've posted shows what you would get, which is what you asked for.
-
user1330974 about 7 yearsafter running the
df.loc[df.groupby('campaignname').amount.idxmax()]
, I still see all rows in the data frame... Thank you for your help!