Pandas Dataframe: Replacing NaN with row average

python pandas dataframe missing-data

22,341

Solution 1

As commented the axis argument to fillna is NotImplemented.

df.fillna(df.mean(axis=1), axis=1)

Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.

For now you'll need to iterate through:

m = df.mean(axis=1)
for i, col in enumerate(df):
    # using i allows for duplicate columns
    # inplace *may* not always work here, so IMO the next line is preferred
    # df.iloc[:, i].fillna(m, inplace=True)
    df.iloc[:, i] = df.iloc[:, i].fillna(m)

print(df)

   c1  c2   c3
0   1   4  7.0
1   2   5  3.5
2   3   6  9.0

An alternative is to fillna the transpose and then transpose, which may be more efficient...

df.T.fillna(df.mean(axis=1)).T

Solution 2

As an alternative, you could also use an apply with a lambda expression like this:

df.apply(lambda row: row.fillna(row.mean()), axis=1)

yielding also

    c1   c2   c3
0  1.0  4.0  7.0
1  2.0  5.0  3.5
2  3.0  6.0  9.0

Solution 3

I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.

The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages. Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.

We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.

# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]

# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df

giving

    c1   c2   c3
0   1.0  4.0  7.0
1   2.0  5.0  3.5
2   3.0  6.0  9.0

Solution 4

For an efficient solution, use `DataFrame.where`:

We could use where on axis=0:

df.where(df.notna(), df.mean(axis=1), axis=0)

or mask on axis=0:

df.mask(df.isna(), df.mean(axis=1), axis=0)

By using axis=0, we can fill in the missing values in each column with the row averages.

These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.

Some benchmarks:

df = creator()

>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

def creator():
    A = np.random.rand(300_000, 20)
    A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
    return pd.DataFrame(A)

View more solutions

22,341

Aenaon

Updated on July 09, 2022

Comments

Aenaon almost 2 years
I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
```
import pandas as pd
import numpy as np

pd.__version__
Out[44]:
'0.15.2'

In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df

Out[45]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9

In [46]:  
df.fillna(df.mean(axis=1)) 

Out[46]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9
```
However something like this looks to work fine
```
df.fillna(df.mean(axis=0)) 

Out[47]:
    c1  c2  c3
0   1   4   7
1   2   5   8
2   3   6   9
```
- Alex Riley over 8 years
  
  That functionality is still not implemented - the issue remains open.
Cleb over 3 years

Isn't that the same as the accepted solution, just that you use .transpose instead of .T?
Sara over 2 years

Great answer Cleb! It just worked for me. An interesting efficient short code for the whole dataframe. Thank you.