Multiple aggregations of the same column using pandas GroupBy.agg()
Solution 1
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
Solution 2
TLDR; Pandas groupby.agg
has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use
df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))
Mean Sum
dummy
1 0.036901 0.369012
OR
df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')
Mean Sum
dummy
1 0.036901 0.369012
Pandas >= 0.25: Named Aggregation
Pandas has changed the behavior of GroupBy.agg
in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.
From the documentation,
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in
GroupBy.agg()
, known as “named aggregation”, where
- The keywords are the output column names
- The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>)
.
import pandas as pd
pd.__version__
# '0.25.0.dev0+840.g989f912ee'
# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]
})
df.groupby('kind').agg(
max_height=('height', 'max'), min_weight=('weight', 'min'),)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
Alternatively, you can use pd.NamedAgg
(essentially a namedtuple) which makes things more explicit.
df.groupby('kind').agg(
max_height=pd.NamedAgg(column='height', aggfunc='max'),
min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
It is even simpler for Series, just pass the aggfunc to a keyword argument.
df.groupby('kind')['height'].agg(max_height='max', min_height='min')
max_height min_height
kind
cat 9.5 9.1
dog 34.0 6.0
Lastly, if your column names aren't valid python identifiers, use a dictionary with unpacking:
df.groupby('kind')['height'].agg(**{'max height': 'max', ...})
Pandas < 0.25
In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning
:
df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed
# in a future version
Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.
df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})
returns
op1 op2
dummy
1 0.328953 0.032895
Or,
df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])
op1 op2
dummy
1 0.328953 0.032895
Solution 3
Would something like this work:
In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]:
func2 func1
dummy
1 -4.263768e-16 -0.188565
Related videos on Youtube
Comments
-
ely almost 2 years
Is there a pandas built-in way to apply two different aggregating functions
f1, f2
to the same columndf["returns"]
, without having to callagg()
multiple times?Example dataframe:
import pandas as pd import datetime as dt import numpy as np pd.np.random.seed(0) df = pd.DataFrame({ "date" : [dt.date(2012, x, 1) for x in range(1, 11)], "returns" : 0.05 * np.random.randn(10), "dummy" : np.repeat(1, 10) })
The syntactically wrong, but intuitively right, way to do it would be:
# Assume `f1` and `f2` are defined for aggregating. df.groupby("dummy").agg({"returns": f1, "returns": f2})
Obviously, Python doesn't allow duplicate keys. Is there any other manner for expressing the input to
agg()
? Perhaps a list of tuples[(column, function)]
would work better, to allow multiple functions applied to the same column? Butagg()
seems like it only accepts a dictionary.Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)
-
jezrael over 5 yearsRelated -Aggregation in pandas
-
cs95 almost 5 yearsFrom 0.25 onwards, pandas provides a more intuitive syntax for multiple aggregations, as well as renaming output columns. See the documentation on Named Aggregations.
-
smci over 4 yearsFYI this question was asked way back on pandas 0.8.x in 9/2012
-
cs95 over 4 yearsFYI the accepted answer is also deprecated - don't pass agg() a dict of dicts.
-
smci over 4 years@cs95: I know it's deprecated, I'm saying SO is becoming littered with old stale solutions from old versions. SO doesn't have a way of marking that - other than comments.
-
ely almost 4 years@cs95 the first part of the accepted answer, using a dict with a list for the value, is still the best way to solve this, even in the latest editions of pandas. I do not agree that the accepted answer is wrong / deprecated or needs to be unaccepted. The second part with a nested dict does lead to a deprecation warning, but the warning given does not make any sense and does not offer a solution for getting multiple newly-named columns derived from aggregates of the same source column
-
ely almost 4 yearsThe deprecation warning at the core says,
For column-specific groupby renaming, use named aggregation >>> df.groupby(...).agg(name=('column', aggfunc))
but in spending the last 10 minutes trying to use this syntax to achieve the same multi-aggregate-with-renaming operation as in the accepted answer here, I couldn't get it to work. -
cs95 over 3 yearsLate to respond, the first part is indeed the generally accepted method for multi aggregations of the same col but does not include guidance on renaming output columns (although admittedly that wasn't an explicit requirement at the OP). As to my previous comment, I should have been more specific, yes the second half of the answer is deprecated. Also, the new syntax allows you a way to do what you are attempting, if you're having trouble please feel free to open a new question and I'd be happy to look into it.
-
cs95 over 3 yearsShouldn't have called for unacceptance, err, removed that comment. Someone else called for it first, though.
-
-
ely over 11 yearsNo, this does not work. If you look at the doc string for
aggregate
it explicitly says that when adict
is passed, the keys must be column names. So either your example is something you typed in without checking for this error, or else Pandas breaks its own docs here. -
ely over 11 yearsN/M I didn't see the extra call to
returns
in there. So this is the Series version of aggregate? I'm looking to do the DataFrame version of aggregate, and I want to apply several different aggregations to each column all at once. -
Chang She over 11 yearsTry this: df.groupby('dummy').agg({'returns': {'func1' : lambda x: x.sum(), 'func2' : lambda x: x.mean()}})
-
ely over 11 yearsIt gives an assertion error with no message. From the looks of the code (pandas.core.internals.py, lines 406-408, version 0.7.3) it looks like it does a check at the end to make sure it's not returning more columns than there are keys within the first layer of the aggregation dictionary.
-
Chang She over 11 yearsWorks fine on master. You want to try updating?
-
ely over 11 yearsCan't: it's a network maintined Python install. I can only use the packages in our network version, which is going to remain at 0.7.3 for a while.
-
Ben over 8 yearsIs there a way to specify the result column names?
-
Stewbaca over 8 years@Ben I think you must use a rename afterwards. example by Tom Augspurger (see cell 25)
-
bmu about 8 years@Ben: I added an example
-
sparc_spread about 7 yearsUpvoted this yesterday, because this is such a common use case, and yet it is not in the pandas
agg()
documentation at all! Excellent solution. -
joelostblom almost 7 years@sparc_spread Passing multiple functions as a list is well described in the pandas documentation. Renaming and passing multiple functions as a dictionary will be deprecated in a future version of pandas. Details are in the 0.20 change log, which I also summarized elsewhere on SO.
-
sparc_spread almost 7 yearsThanks - will check it out
-
user3226167 over 6 years@ben flatten column index: stackoverflow.com/questions/14507794/…
-
ShikharDua about 6 yearsWhat if we have to use a list of lambda functions ?
-
ad_s over 5 yearsAs for the 'as a dictionary' example : using a dict with renaming is deprecated and will be removed in a future version
-
cs95 over 5 yearsIt has already been said, but using dictionaries for renaming output columns from age is deprecated. You can instead specify a list of tuples. See this answer.
-
NKSHELL over 4 yearsThis should be the top answer because of using a more clear and clean solution using the newer version of the interface.
-
victorlin over 4 yearsThe examples used for named aggregation doesn't solve the original problem of using multiple aggregations on the same column. For example, can you aggregate by both min and max for height without first subsetting for
df.groupby('kind')['height']
? -
cs95 over 4 years@victor I added a TLDR at the top of the answer that directly addresses the question. And the answer to your second question is yes, please take a look at the edit on my answer.
-
Onur Ece about 4 yearsA more generic code to the last example of your >=0.25 answer to handle aggregating multiple columns like this would've been great.
df.groupby("kind").agg(**{ 'max height': pd.NamedAgg(column='height', aggfunc=max), 'min weight': pd.NamedAgg(column='weight', aggfunc=min) })
-
ely almost 4 years@cs95 The deprecation warning that appears in the dictionary-renaming case is unfortunately not good enough to help users. For example, it is not intuitive that the
name
keyword used in that warning message should be the new name to use, and it seems like a confusing reorganization of this much clearer dict-based approach. For example, what if the new column names need to be derived programmatically at an intermediate stage of the calculation, and thus won't be available to use this way as kwargs without constructing another dict? -
jabberwocky over 3 yearsGreat answer! How do you do this with lambda functions?
-
cs95 over 3 years@jabberwocky For the first example, do something like
df.groupby('dummy').agg(Mean=('returns', lambda x: x.mean()), Sum=('returns', lambda x: x.sum()))
-
bers about 3 yearsI am getting "pandas.core.base.SpecificationError: nested renamer is not supported" with your second example on
pandas
1.2.2. -
P D over 2 yearspandas 1.3.2, getting
pandas.core.base.SpecificationError: nested renamer is not supported
for the second example as well. -
Ufos over 2 yearsUse
df.columns = ['_'.join(a) for a in df.columns.to_flat_index()]
to autorename the result -
Ufos over 2 yearsI think this post is missing a hint on how to do auto-renaming. E.g.
dfa = df.groupby('gr1').agg({'height':['min', 'max']}).reset_index()
and thendfa.columns = ['_'.join(a) for a in dfa.columns.to_flat_index()]
-- now I don't need to typemax_height=pd.NamedAgg...
-
Ufos over 2 yearsI constantly and casually aggregate multiple columns on various levels. Typing a name for each of them is a huge waste of time, and a potential source of naming errors (first did 'v1_mean`, later decided to use the median, but didn't rename the column).