NumPy: calculate averages with NaNs removed
Solution 1
I think what you want is a masked array:
dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer
Edit: Combining all of the timing data
from timeit import Timer
setupstr="""
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""
method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
"""
N = 2
t1 = Timer(method1, setupstr).timeit(N)
t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
print 'Time: %f\tRatio: %f' % (t1,t1/t1 )
print 'Time: %f\tRatio: %f' % (t2,t2/t1 )
print 'Time: %f\tRatio: %f' % (t3,t3/t1 )
print 'Time: %f\tRatio: %f' % (t4,t4/t1 )
print 'Time: %f\tRatio: %f' % (t5,t5/t1 )
Returns:
Time: 0.045454 Ratio: 1.000000
Time: 8.179479 Ratio: 179.950595
Time: 0.060988 Ratio: 1.341755
Time: 0.070955 Ratio: 1.561029
Time: 0.065152 Ratio: 1.433364
Solution 2
If performance matters, you should use bottleneck.nanmean()
instead:
http://pypi.python.org/pypi/Bottleneck
Solution 3
Assuming you've also got SciPy installed:
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean
Solution 4
From numpy 1.8 (released 2013-10-30) onwards, nanmean
does precisely what you need:
>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5
Solution 5
A masked array with the nans filtered out can also be created on the fly:
print np.ma.masked_invalid(dat).mean(1)
Mike T
Hydrogeologist, numerical modeller and GIS professional. My main programming languages that I use are Python, R, SQL. I dabble with Fortran and C/C++/C# on occasions. Thanks to anyone that has helped me!
Updated on July 09, 2022Comments
-
Mike T almost 2 years
How can I calculate matrix mean values along a matrix, but to remove
nan
values from calculation? (For R people, thinkna.rm = TRUE
).Here is my [non-]working example:
import numpy as np dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]]) print(dat) print(dat.mean(1)) # [ 2. nan nan nan]
With NaNs removed, my expected output would be:
array([ 2., 4.5, 6., nan])
-
JoshAdel about 13 yearsI hadn't thought to use this. It's a nice one-liner, but it's still ~1.5-2x slower than my solution in my tests. Still +1 for exposing me to a
np.ma
method that I hadn't looked at before. -
JoshAdel about 13 yearsJust for completeness since I've timed all of the other code -
stats.stats.nanmean
is ~1.5x slower than thenp.ma
solution. -
mathtick over 11 yearsI think scipy.nanmean should be the first thing you try. I wonder if it is still slow?
-
JoshAdel over 11 years@mathtick There are a variety of ways of accomplishing what the OP asked. I offered one such method that is a bit more verbose, but is faster than all of the other suggested ones that are benchmarked above, at least on my machine (this still holds true now with updated versions of scipy and numpy).
-
JoshAdel over 11 years@mathtick Furthermore, there is no
scipy.nanmean
method in scipy 0.10 or 0.11 as far as I can tell. There isscipy.stats.stats.nanmean
andscipy.stats.nanmean
, which are equivalent and I tested above. -
mathtick over 11 yearsSorry, that should be scipy.stats.nanmean ... and I'm running cipy.__version__ '0.10.1'.
-
denis over 11 yearsscipy.stats.nanmean and .nanstd do axis= too (with default axis=0 not None)
-
Dr. Jan-Philip Gehrcke about 11 yearsI tested this in one dimension and
np.nansum(dat) / np.sum(~np.isnan(dat))
is slightly faster thannp.mean(np.ma.masked_array(dat, np.isnan(dat)))
. However, as pointed out earlier, bottleneck is 10x faster. -
Sklavit over 8 yearsIt seems
np.nansum(dat)
is the best.Python 2.7.11 |Anaconda 2.4.1 (64-bit) IPython 4.0.1 In[190]: %timeit method1() 100 loops, best of 3: 7.09 ms per loop In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat] 1 loops, best of 3: 1.04 s per loop In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat]) 10 loops, best of 3: 19.6 ms per loop In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1) 100 loops, best of 3: 11.8 ms per loop In[194]: %timeit nanmean(dat,axis=1) 100 loops, best of 3: 6.36 ms per loop