Pandas: Impute NaN's

python pandas nan dataframe mean

14,607

Disclaimer: I'm not really interested in the fastest solution but the most pandorable.

Here, I think that would be something like:

>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)

which produces

>>> df
    id   type  amount
0    1    one   345.0
1    2    one   928.0
2    3    two   942.0
3    2  three   645.0
4    2    two   113.0
5    3  three   942.0
6    1    one   442.0
7    1    two   539.0
8    1    one   442.0
9    2  three   814.0
10   4    one   615.2

[11 rows x 3 columns]

There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.

14,607

Author by

Zhubarb

When you stare long into the thesys, the thesys stares back into you.

Updated on July 17, 2022

Comments

Zhubarb almost 2 years

I have an incomplete dataframe, incomplete_df, as below. I want to impute the missing amounts with the average amount of the corresponding id. If the average for that specific id is itself NaN (see id=4), I want to use the overall average.

Below are the example data and my highly inefficient solution:

import pandas as pd
import numpy as np
incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
                              'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
                         'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN] 
                         }, columns=['id','type','amount'])

# Forrest Gump Solution
for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
    cur_id = incomplete_df.loc[idx, 'id']
    if (cur_id in means.index ):
        incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
    else:
        incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's

What is the fastest and the most pythonic/pandonic way to achieve this?

Zhubarb over 10 years

Thank you DSM, if I had a placeholder, e.g. 0, in the place of the NaN's, would the fastest way be replacing back all 0's as NaN's and then follow your solution? ( I know putting 0's in place of NaN's is a stupid idea - alas what I have to work with is what I've got.)
DSM over 10 years

That's probably what I'd do. The use of NaN to represent missing data runs pretty deep in pandas, and so the simplest native way to do something usually requires getting your data aligned to that. I don't know about fastest-- you can use timeit to compare. Swapping 0s for NaNs will be both linear and vectorized anyway, so it wouldn't add much to the runtime.
3pitt over 6 years

just to confirm, those are two alternatives, and not both necessary to complete the single operation?