pandas: GroupBy .pipe() vs .apply()
What pipe
does is to allow you to pass a callable with the expectation that the object that called pipe
is the object that gets passed to the callable.
With apply
we assume that the object that calls apply
has subcomponents that will each get passed to the callable that was passed to apply
. In the context of a groupby
the subcomponents are slices of the dataframe that called groupby
where each slice is a dataframe itself. This is analogous for a series groupby
.
The main difference between what you can do with a pipe
in a groupby
context is that you have available to the callable the entire scope of the the groupby
object. For apply, you only know about the local slice.
Setup
Consider df
df = pd.DataFrame(dict(
A=list('XXXXYYYYYY'),
B=range(10)
))
A B
0 X 0
1 X 1
2 X 2
3 X 3
4 Y 4
5 Y 5
6 Y 6
7 Y 7
8 Y 8
9 Y 9
Example 1
Make the entire 'B'
column sum to 1
while each sub-group sums to the same amount. This requires that the calculation be aware of how many groups exist. This is something we can't do with apply
because apply
wouldn't know how many groups exist.
s = df.groupby('A').B.pipe(lambda g: df.B / g.transform('sum') / g.ngroups)
s
0 0.000000
1 0.083333
2 0.166667
3 0.250000
4 0.051282
5 0.064103
6 0.076923
7 0.089744
8 0.102564
9 0.115385
Name: B, dtype: float64
Note:
s.sum()
0.99999999999999989
And:
s.groupby(df.A).sum()
A
X 0.5
Y 0.5
Name: B, dtype: float64
Example 2
Subtract the mean of one group from the values of another. Again, this can't be done with apply
because apply
doesn't know about other groups.
df.groupby('A').B.pipe(
lambda g: (
g.get_group('X') - g.get_group('Y').mean()
).append(
g.get_group('Y') - g.get_group('X').mean()
)
)
0 -6.5
1 -5.5
2 -4.5
3 -3.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
Name: B, dtype: float64
foglerit
Updated on June 05, 2022Comments
-
foglerit almost 2 years
In the example from the pandas documentation about the new
.pipe()
method for GroupBy objects, an.apply()
method accepting the same lambda would return the same results.In [195]: import numpy as np In [196]: n = 1000 In [197]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n), .....: 'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n), .....: 'Revenue': (np.random.random(n)*50+10).round(2), .....: 'Quantity': np.random.randint(1, 10, size=n)}) In [199]: (df.groupby(['Store', 'Product']) .....: .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum()) .....: .unstack().round(2)) Out[199]: Product Product_1 Product_2 Product_3 Store Store_1 6.93 6.82 7.15 Store_2 6.69 6.64 6.77
I can see how the
pipe
functionality differs fromapply
for DataFrame objects, but not for GroupBy objects. Does anyone have an explanation or examples of what can be done withpipe
but not withapply
for a GroupBy?