Numpy mean of nonzero values

31,118

Solution 1

Get the count of non-zeros in each row and use that for averaging the summation along each row. Thus, the implementation would look something like this -

np.true_divide(matrix.sum(1),(matrix!=0).sum(1))

If you are on an older version of NumPy, you can use float conversion of the count to replace np.true_divide, like so -

matrix.sum(1)/(matrix!=0).sum(1).astype(float)

Sample run -

In [160]: matrix
Out[160]: 
array([[0, 0, 1, 0, 2],
       [1, 0, 0, 2, 0],
       [0, 1, 1, 0, 0],
       [0, 2, 2, 2, 2]])

In [161]: np.true_divide(matrix.sum(1),(matrix!=0).sum(1))
Out[161]: array([ 1.5,  1.5,  1. ,  2. ])

Another way to solve the problem would be to replace zeros with NaNs and then use np.nanmean, which would ignore those NaNs and in effect those original zeros, like so -

np.nanmean(np.where(matrix!=0,matrix,np.nan),1)

From performance point of view, I would recommend the first approach.

Solution 2

I will detail here the more general solution that uses a masked array. To illustrate the details let's create an lower triangular matrix with only ones:

matrix = np.tril(np.ones((5, 5)), 0)

If you the terminology above is not clear this matrix looks like this:

  [[ 1.,  0.,  0.,  0.,  0.],
   [ 1.,  1.,  0.,  0.,  0.],
   [ 1.,  1.,  1.,  0.,  0.],
   [ 1.,  1.,  1.,  1.,  0.],
   [ 1.,  1.,  1.,  1.,  1.]]

Now, we want our function to return an average of 1 for each of rows. Or in other words that the mean over the axis 1 is equal to a vector of five ones. In order to achieve this we created a masked matrix where the entries whose values are zero are considered invalid. This can be achieved withnp.ma.masked_equal:

masked = np.ma.masked_equal(matrix, 0)

Finally we perform numpy operations in this array that will systematically ignore the masked elements (the 0's). With this in mind we obtain the desired result by:

masked.mean(axis=1)

This should produce a vector whose entries are only ones.


In more detail the output of np.ma.masked_equal(matrix, 0) should look like this:

masked_array(data =
 [[1.0 -- -- -- --]
 [1.0 1.0 -- -- --]
 [1.0 1.0 1.0 -- --]
 [1.0 1.0 1.0 1.0 --]
 [1.0 1.0 1.0 1.0 1.0]],
             mask =
 [[False  True  True  True  True]
 [False False  True  True  True]
 [False False False  True  True]
 [False False False False  True]
 [False False False False False]],
       fill_value = 0.0)

This indicates that eh values on -- are considered invalid. This is also shown in the mask attribute of the masked arrays as True which indicates that IT IS an invalid element and therefore should be ignored.

Finally the output of the mean operation on this array should is:

masked_array(data = [1.0 1.0 1.0 1.0 1.0],
             mask = [False False False False False],
       fill_value = 1e+20)
Share:
31,118

Related videos on Youtube

HimanAB
Author by

HimanAB

Updated on January 08, 2021

Comments

  • HimanAB
    HimanAB over 3 years

    I have a matrix of size N*M and I want to find the mean value for each row. The values are from 1 to 5 and entries that do not have any value are set to 0. However, when I want to find the mean using the following method, it gives me the wrong mean as it also counts the entries that have value of 0.

    matrix_row_mean= matrix.mean(axis=1)
    

    How can I get the mean of only nonzero values?

  • HimanAB
    HimanAB almost 8 years
    np has no attribute true_divide
  • hpaulj
    hpaulj almost 8 years
    The masked array approach is compact (but not necessarily faster): np.ma.masked_equal(matrix, 0).mean(axis=1)
  • David Alvarez
    David Alvarez about 5 years
    really clear explanation with great simple examples. .. thanks !