Pandas: Impute NaN's

14,607

Disclaimer: I'm not really interested in the fastest solution but the most pandorable.

Here, I think that would be something like:

>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)

which produces

>>> df
    id   type  amount
0    1    one   345.0
1    2    one   928.0
2    3    two   942.0
3    2  three   645.0
4    2    two   113.0
5    3  three   942.0
6    1    one   442.0
7    1    two   539.0
8    1    one   442.0
9    2  three   814.0
10   4    one   615.2

[11 rows x 3 columns]

There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.

Share:
14,607
Zhubarb
Author by

Zhubarb

When you stare long into the thesys, the thesys stares back into you.

Updated on July 17, 2022

Comments

  • Zhubarb
    Zhubarb almost 2 years

    I have an incomplete dataframe, incomplete_df, as below. I want to impute the missing amounts with the average amount of the corresponding id. If the average for that specific id is itself NaN (see id=4), I want to use the overall average.

    Below are the example data and my highly inefficient solution:

    import pandas as pd
    import numpy as np
    incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
                                  'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
                             'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN] 
                             }, columns=['id','type','amount'])
    
    # Forrest Gump Solution
    for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
        cur_id = incomplete_df.loc[idx, 'id']
        if (cur_id in means.index ):
            incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
        else:
            incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's
    

    What is the fastest and the most pythonic/pandonic way to achieve this?

  • Zhubarb
    Zhubarb over 10 years
    Thank you DSM, if I had a placeholder, e.g. 0, in the place of the NaN's, would the fastest way be replacing back all 0's as NaN's and then follow your solution? ( I know putting 0's in place of NaN's is a stupid idea - alas what I have to work with is what I've got.)
  • DSM
    DSM over 10 years
    That's probably what I'd do. The use of NaN to represent missing data runs pretty deep in pandas, and so the simplest native way to do something usually requires getting your data aligned to that. I don't know about fastest-- you can use timeit to compare. Swapping 0s for NaNs will be both linear and vectorized anyway, so it wouldn't add much to the runtime.
  • 3pitt
    3pitt over 6 years
    just to confirm, those are two alternatives, and not both necessary to complete the single operation?