NumPy: calculate averages with NaNs removed

python numpy nan

46,788

Solution 1

I think what you want is a masked array:

dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer

Edit: Combining all of the timing data

   from timeit import Timer
    
    setupstr="""
import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""  

    method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)    
"""
    
    N = 2
    t1 = Timer(method1, setupstr).timeit(N)
    t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
    t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
    t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
    t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
    
    print 'Time: %f\tRatio: %f' % (t1,t1/t1 )
    print 'Time: %f\tRatio: %f' % (t2,t2/t1 )
    print 'Time: %f\tRatio: %f' % (t3,t3/t1 )
    print 'Time: %f\tRatio: %f' % (t4,t4/t1 )
    print 'Time: %f\tRatio: %f' % (t5,t5/t1 )

Returns:

Time: 0.045454  Ratio: 1.000000
Time: 8.179479  Ratio: 179.950595
Time: 0.060988  Ratio: 1.341755
Time: 0.070955  Ratio: 1.561029
Time: 0.065152  Ratio: 1.433364

Solution 2

If performance matters, you should use bottleneck.nanmean() instead:

http://pypi.python.org/pypi/Bottleneck

Solution 3

Assuming you've also got SciPy installed:

http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean

Solution 4

From numpy 1.8 (released 2013-10-30) onwards, nanmean does precisely what you need:

>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5

Solution 5

A masked array with the nans filtered out can also be created on the fly:

print np.ma.masked_invalid(dat).mean(1)

View more solutions

46,788

Author by

Mike T

Hydrogeologist, numerical modeller and GIS professional. My main programming languages that I use are Python, R, SQL. I dabble with Fortran and C/C++/C# on occasions. Thanks to anyone that has helped me!

Updated on July 09, 2022

Comments

Mike T almost 2 years

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here is my [non-]working example:

import numpy as np
dat = np.array([[1, 2, 3],
                [4, 5, np.nan],
                [np.nan, 6, np.nan],
                [np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1))  # [  2.  nan  nan  nan]

With NaNs removed, my expected output would be:

array([ 2.,  4.5,  6.,  nan])

JoshAdel about 13 years

I hadn't thought to use this. It's a nice one-liner, but it's still ~1.5-2x slower than my solution in my tests. Still +1 for exposing me to a np.ma method that I hadn't looked at before.
JoshAdel about 13 years

Just for completeness since I've timed all of the other code - stats.stats.nanmean is ~1.5x slower than the np.ma solution.
mathtick over 11 years

I think scipy.nanmean should be the first thing you try. I wonder if it is still slow?
JoshAdel over 11 years

@mathtick There are a variety of ways of accomplishing what the OP asked. I offered one such method that is a bit more verbose, but is faster than all of the other suggested ones that are benchmarked above, at least on my machine (this still holds true now with updated versions of scipy and numpy).
JoshAdel over 11 years

@mathtick Furthermore, there is no scipy.nanmean method in scipy 0.10 or 0.11 as far as I can tell. There is scipy.stats.stats.nanmean and scipy.stats.nanmean, which are equivalent and I tested above.
mathtick over 11 years

Sorry, that should be scipy.stats.nanmean ... and I'm running cipy.__version__ '0.10.1'.
denis over 11 years

scipy.stats.nanmean and .nanstd do axis= too (with default axis=0 not None)
Dr. Jan-Philip Gehrcke about 11 years

I tested this in one dimension and np.nansum(dat) / np.sum(~np.isnan(dat)) is slightly faster than np.mean(np.ma.masked_array(dat, np.isnan(dat))). However, as pointed out earlier, bottleneck is 10x faster.
Sklavit over 8 years

It seems np.nansum(dat) is the best. Python 2.7.11 |Anaconda 2.4.1 (64-bit) IPython 4.0.1 In[190]: %timeit method1() 100 loops, best of 3: 7.09 ms per loop In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat] 1 loops, best of 3: 1.04 s per loop In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat]) 10 loops, best of 3: 19.6 ms per loop In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1) 100 loops, best of 3: 11.8 ms per loop In[194]: %timeit nanmean(dat,axis=1) 100 loops, best of 3: 6.36 ms per loop