Pandas: Impute NaN's
14,607
Disclaimer: I'm not really interested in the fastest solution but the most pandorable.
Here, I think that would be something like:
>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)
which produces
>>> df
id type amount
0 1 one 345.0
1 2 one 928.0
2 3 two 942.0
3 2 three 645.0
4 2 two 113.0
5 3 three 942.0
6 1 one 442.0
7 1 two 539.0
8 1 one 442.0
9 2 three 814.0
10 4 one 615.2
[11 rows x 3 columns]
There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.
Author by
Zhubarb
When you stare long into the thesys, the thesys stares back into you.
Updated on July 17, 2022Comments
-
Zhubarb almost 2 years
I have an incomplete dataframe,
incomplete_df
, as below. I want to impute the missingamount
s with the averageamount
of the correspondingid
. If the average for that specificid
is itself NaN (seeid=4
), I want to use the overall average.Below are the example data and my highly inefficient solution:
import pandas as pd import numpy as np incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4], 'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'], 'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN] }, columns=['id','type','amount']) # Forrest Gump Solution for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN cur_id = incomplete_df.loc[idx, 'id'] if (cur_id in means.index ): incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id. else: incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's
What is the fastest and the most pythonic/pandonic way to achieve this?
-
Zhubarb over 10 yearsThank you DSM, if I had a placeholder, e.g. 0, in the place of the NaN's, would the fastest way be replacing back all 0's as NaN's and then follow your solution? ( I know putting 0's in place of NaN's is a stupid idea - alas what I have to work with is what I've got.)
-
DSM over 10 yearsThat's probably what I'd do. The use of NaN to represent missing data runs pretty deep in pandas, and so the simplest native way to do something usually requires getting your data aligned to that. I don't know about fastest-- you can use timeit to compare. Swapping 0s for NaNs will be both linear and vectorized anyway, so it wouldn't add much to the runtime.
-
3pitt over 6 yearsjust to confirm, those are two alternatives, and not both necessary to complete the single operation?