Pandas Dataframe: Replacing NaN with row average

22,341

Solution 1

As commented the axis argument to fillna is NotImplemented.

df.fillna(df.mean(axis=1), axis=1)

Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.

For now you'll need to iterate through:

m = df.mean(axis=1)
for i, col in enumerate(df):
    # using i allows for duplicate columns
    # inplace *may* not always work here, so IMO the next line is preferred
    # df.iloc[:, i].fillna(m, inplace=True)
    df.iloc[:, i] = df.iloc[:, i].fillna(m)

print(df)

   c1  c2   c3
0   1   4  7.0
1   2   5  3.5
2   3   6  9.0

An alternative is to fillna the transpose and then transpose, which may be more efficient...

df.T.fillna(df.mean(axis=1)).T

Solution 2

As an alternative, you could also use an apply with a lambda expression like this:

df.apply(lambda row: row.fillna(row.mean()), axis=1)

yielding also

    c1   c2   c3
0  1.0  4.0  7.0
1  2.0  5.0  3.5
2  3.0  6.0  9.0

Solution 3

I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.

The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages. Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.

We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.

# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]

# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df

giving

    c1   c2   c3
0   1.0  4.0  7.0
1   2.0  5.0  3.5
2   3.0  6.0  9.0

Solution 4

For an efficient solution, use DataFrame.where:

We could use where on axis=0:

df.where(df.notna(), df.mean(axis=1), axis=0)

or mask on axis=0:

df.mask(df.isna(), df.mean(axis=1), axis=0)

By using axis=0, we can fill in the missing values in each column with the row averages.

These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.

Some benchmarks:

df = creator()

>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

def creator():
    A = np.random.rand(300_000, 20)
    A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
    return pd.DataFrame(A)
Share:
22,341

Related videos on Youtube

Aenaon
Author by

Aenaon

Updated on July 09, 2022

Comments

  • Aenaon
    Aenaon almost 2 years

    I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here

    import pandas as pd
    import numpy as np
    ​
    pd.__version__
    Out[44]:
    '0.15.2'
    
    In [45]:
    df = pd.DataFrame()
    df['c1'] = [1, 2, 3]
    df['c2'] = [4, 5, 6]
    df['c3'] = [7, np.nan, 9]
    df
    
    Out[45]:
        c1  c2  c3
    0   1   4   7
    1   2   5   NaN
    2   3   6   9
    
    In [46]:  
    df.fillna(df.mean(axis=1)) 
    
    Out[46]:
        c1  c2  c3
    0   1   4   7
    1   2   5   NaN
    2   3   6   9
    

    However something like this looks to work fine

    df.fillna(df.mean(axis=0)) 
    
    Out[47]:
        c1  c2  c3
    0   1   4   7
    1   2   5   8
    2   3   6   9
    
    • Alex Riley
      Alex Riley over 8 years
      That functionality is still not implemented - the issue remains open.
  • Cleb
    Cleb over 3 years
    Isn't that the same as the accepted solution, just that you use .transpose instead of .T?
  • Sara
    Sara over 2 years
    Great answer Cleb! It just worked for me. An interesting efficient short code for the whole dataframe. Thank you.