Pandas aggregation ignoring NaN's
Solution 1
Use numpy's nansum and nanmean:
from numpy import nansum
from numpy import nanmean
data.groupby(groupbyvars).agg({'amount': [ nansum, nanmean]}).reset_index()
As a workaround for older version of numpy, and also a way to fix your last try:
When you do pd.Series.sum(skipna=True)
you actually call the method. If you want to use it like this you want to define a partial. So if you don't have nanmean
, let's define s_na_mean
and use that:
from functools import partial
s_na_mean = partial(pd.Series.mean, skipna = True)
Solution 2
It might be too late but anyways it might be useful for others.
Try apply function:
import numpy as np
import pandas as pd
def nan_agg(x):
res = {}
res['nansum'] = x.loc[ not x['amount'].isnull(), :]['amount'].sum()
res['nanmean'] = x.loc[ not x['amount'].isnull(), :]['amount'].mean()
return pd.Series(res, index=['nansum', 'nanmean'])
result = data.groupby(groupbyvars).apply(nan_agg).reset_index()
Zhubarb
When you stare long into the thesys, the thesys stares back into you.
Updated on July 18, 2022Comments
-
Zhubarb almost 2 years
I aggregate my Pandas dataframe:
data
. Specifically, I want to get the average and sumamount
s by tuples of [origin
andtype
]. For averaging and summing I tried the numpy functions below:import numpy as np import pandas as pd result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index()
My issue is that the
amount
column includesNaN
s, which causes theresult
of the above code to have a lot ofNaN
average and sums.I know both
pd.Series.sum
andpd.Series.mean
haveskipna=True
by default, so why am I still gettingNaN
s here?I also tried this, which obviously did not work:
data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index()
EDIT: Upon @Korem's suggestion, I also tried to use a
partial
as below:s_na_mean = partial(pd.Series.mean, skipna = True) data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index()
but get this error:
error: 'functools.partial' object has no attribute '__name__'
-
Zhubarb over 9 yearsThank you, I use numpy-1.7.1-py2.7-win32.egg, it does not like
nanmean
throwing the error:'module' object has no attribute 'nanmean'
. (I just checked,nanmean
is new in verison 1.8.0 -
Zhubarb over 9 yearsBut
np.nansum
seems to be added in version 1.8.0 as well. It is curious that I do not get the same error for that... -
Zhubarb over 9 yearsThanks Korem, I tried this but it did not work, I edited my question, giving the error. Also, isn't
skipna=True
forpd.Series.mean
by default anyways? -
Korem over 9 years@Zhubarb it is on by default, which suggest that the problem you're seeing is not where you think it is.
-
Zhubarb over 9 yearsYou are right, I tried this, which ran:
data.groupby(groupbyvars).agg({'amount': [ np.nansum, lambda x: pd.Series.mean(x,skipna=True)]}).reset_index()
but still getNaN
s. I will investigate further. Maybe those are the cases for which all I have is NaN.. -
user3226167 over 6 yearspandas doc : "skipna : boolean, default True", "Exclude NA/null values. If an entire row/column is NA, the result will be NA"