Most efficient way to forward-fill NaN values in numpy array
Solution 1
Here's one approach -
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
If you don't want to create another array and just fill the NaNs in arr
itself, replace the last step with this -
arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
Sample input, output -
In [179]: arr
Out[179]:
array([[ 5., nan, nan, 7., 2., 6., 5.],
[ 3., nan, 1., 8., nan, 5., nan],
[ 4., 9., 6., nan, nan, nan, 7.]])
In [180]: out
Out[180]:
array([[ 5., 5., 5., 7., 2., 6., 5.],
[ 3., 3., 1., 8., 8., 5., 5.],
[ 4., 9., 6., 6., 6., 6., 7.]])
Solution 2
Update: As pointed out by financial_physician in the comments, my initially proposed solution can simply be exchanged with ffill
on the reversed array and then reversing the result. There is no relevant performance loss. My initial solution seems to be 2% or 3% faster according to %timeit
. I updated the code example below but left my initial text as it was.
For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.
Here is the code:
# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), 0)
np.maximum.accumulate(idx, axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr):
return ffill(arr[:, ::-1])[:, ::-1]
# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)
print('\nffill')
print(ffill(arr))
print('\nbfill')
print(bfill(arr))
Output:
Array:
[[ 5. nan nan 7. 2.]
[ 3. nan 1. 8. nan]
[ 4. 9. 6. nan nan]]
ffill
[[5. 5. 5. 7. 2.]
[3. 3. 1. 8. 8.]
[4. 9. 6. 6. 6.]]
bfill
[[ 5. 7. 7. 7. 2.]
[ 3. 1. 1. 8. nan]
[ 4. 9. 6. nan nan]]
Edit: Update according to comment of MS_
Solution 3
Use Numba. This should give a significant speedup:
import numba
@numba.jit
def loops_fill(arr):
...
Solution 4
I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:
def np_ffill(arr, axis):
idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]
AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.
Testing:
def random_array(shape):
choices = [1, 2, 3, 4, np.nan]
out = np.random.choice(choices, size=shape)
return out
ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit
Output:
arr
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. nan 4. 4. 3.]
[ 3. 2. nan 4. nan nan 3. 4.]
[ 2. 2. 2. nan 1. 1. nan 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. nan]
[ 4. 2. nan 4. 4. 3. nan 4.]
[ 2. 4. 2. 1. 4. 1. 3. nan]]]
ffull
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. 4. 4. 4. 3.]
[ 3. 2. 1. 4. 4. 4. 3. 4.]
[ 2. 2. 2. 4. 1. 1. 3. 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. 3.]
[ 4. 2. 1. 4. 4. 3. 1. 4.]
[ 2. 4. 2. 1. 4. 1. 3. 4.]]]
Solution 5
I like Divakar's answer, but it doesn't work for an edge case where a row starts with np.nan, like the arr
below
arr = np.array([[9, np.nan, 4, np.nan, 6, 6, 7, 2, 3, np.nan],
[ np.nan, 5, 5, 6, 5, 3, 2, 1, np.nan, 10]])
The output using Divakar's code would be:
[[ 9. 9. 4. 4. 6. 6. 7. 2. 3. 3.]
[nan 4. 5. 6. 5. 3. 2. 1. 1. 10.]]
Divakar's code can be simplified a bit, and the simplified version solves this issue at the same time:
arr[np.isnan(arr)] = arr[np.nonzero(np.isnan(arr))[0], np.nonzero(np.isnan(arr))[1]-1]
In case of several np.nan
s in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nan
s, the following code will "forward fill" all of them with the number before these np.nan
s:
for i in range(0, 5):
value[np.isnan(value)] = value[np.nonzero(np.isnan(value))[0], np.nonzero(np.isnan(value))[1]-1]
Xukrao
Updated on January 13, 2022Comments
-
Xukrao over 2 years
Example Problem
As a simple example, consider the numpy array
arr
as defined below:import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2], [3, np.nan, 1, 8, np.nan], [4, 9, 6, np.nan, np.nan]])
where
arr
looks like this in console output:array([[ 5., nan, nan, 7., 2.], [ 3., nan, 1., 8., nan], [ 4., 9., 6., nan, nan]])
I would now like to row-wise 'forward-fill' the
nan
values in arrayarr
. By that I mean replacing eachnan
value with the nearest valid value from the left. The desired result would look like this:array([[ 5., 5., 5., 7., 2.], [ 3., 3., 1., 8., 8.], [ 4., 9., 6., 6., 6.]])
Tried thus far
I've tried using for-loops:
for row_idx in range(arr.shape[0]): for col_idx in range(arr.shape[1]): if np.isnan(arr[row_idx][col_idx]): arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):
import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix()
Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?
Summary
Is there another more efficient way to 'forward-fill'
nan
values in numpy arrays? (e.g. by using numpy vectorized operations)
Update: Solutions Comparison
I've tried to time all solutions thus far. This was my setup script:
import numba as nb import numpy as np import pandas as pd def random_array(): choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan] out = np.random.choice(choices, size=(1000, 10)) return out def loops_fill(arr): out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out @nb.jit def numba_loops_fill(arr): '''Numba decorator solution provided by shx2.''' out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out def pandas_fill(arr): df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) out = df.as_matrix() return out def numpy_fill(arr): '''Solution provided by Divakar.''' mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] return out
followed by this console input:
%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array())
resulting in this console output:
1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop
-
Xukrao over 7 yearsA vectorized numpy-only solution, nice. Thanks! This solution indeed appears to be faster than the loop-based and pandas-based solutions (see timings in updated question).
-
Divakar over 7 years@Xukrao Yeah I just saw those, thanks for adding in those timing results! Good to see some speedups there!
-
Xukrao over 7 yearsWould Numba only speed up the loops-based solution? Or would it speed up the other solutions as well?
-
shx2 over 7 yearsIt is good for loops. It would not speed up functions implemented in numpy/pandas.
-
Xukrao over 7 yearsThanks! I've included this solution in the timing comparison (see updated question). It looks like the addition of the numba decorator to the loop-based solution reduces its runtime by one order of magnitude.
-
Xukrao over 5 yearsI'm not sure I understand the purpose of this code. What exactly do you mean by 'problem of having leading np.nan after forward-filling'?
-
christian_bock over 5 yearsIn the example array in the beginning of the threat, each entry begins with a non nan. Some people might find themselves dealing with a data set that requires backward filling because forward filling will leave the first entries untouched. So I thought it might be useful to present a solution in this threat.
-
MS_ almost 5 years
idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[0] + 1)
inbfill
should beidx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
-
user189035 over 4 yearsHow do you adapt this solution to the case arr is a one dimensional numpy array? Like
numpy.array([0.83, 0.83, 0.83, 0.83, nan, nan, nan])
? -
C8H10N4O2 almost 4 years@user189035 replace
mask.shape[1]
withmask.size
and removeaxis=1
and replace the last line without = arr[idx]
-
financial_physician almost 3 yearsI had a case where I built a second matrix for what I wanted to forward fill with. On the last line I just replaced
arr
withfillMatrix
. My case was reducing resolution on time-series data, so I forward filled with the most recent entry -
LearnToGrow over 2 years@Xukrao this does not work for a lot of cases. See my answer, I put an example where it fails!
-
financial_physician over 2 yearsIsn't flipping
O(n)
and you're doing it twice so wouldn't flipping, using forward fill, and then unflipping, be just as fast as your bfill method with the original array? -
cchwala over 2 yearsThanks! This is indeed a very good point. I did time your solution and mine using
%%timeit
and there is only a negligible but consistent difference, 10.3 µs (your solution) vs 9.95 µs (my solution). I will update my response accordingly.