Pandas groupby mean() not ignoring NaNs
Solution 1
By default, pandas
skips the Nan
values. You can make it include Nan
by specifying skipna=False
:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
Solution 2
There is mean(skipna=False)
, but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna
parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: @Serge Ballesta, @RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Solution 3
Use the skipna
option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Solution 4
Another approach would be to use a value that is not ignored by default, for example np.inf
:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
Solution 5
There are three different methods for it:
- slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
- faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
- Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
Admin
Updated on July 09, 2022Comments
-
Admin almost 2 years
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd import numpy as np c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]}) c.groupby('b').mean() a b 1 1.5 2 3.0 c.groupby('b').agg(np.mean) a b 1 1.5 2 3.0
I want to receive following result:
a b 1 1.5 2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
-
RoelAdriaans over 4 yearsBefore the mean calculation you can use
fillna(np.inf)
and after the mean you can use.replace([np.inf, -np.inf], np.nan)
to restore the nan values.