Replace NaN in a dataframe with random values

python pandas

17,672

Solution 1

You can use the pandas update command, this way:

1) Generate a random DataFrame with the same columns and index as the original one:

import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)

2) Then use update, so that the NaN values in df will be replaced by the generated random values

df.update(ran)

In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:

import numpy as np; import pandas as pd

M = len(df.index)
N = len(df.columns)

val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)

df.update(ran)

Solution 2

Well, if you use fillna to fill the NaN, a random generator works only once and will fill all N/As with the same number.

So, make sure that a random number is generated and used each time. For a dataframe like this :

          Date         A       B
0   2015-01-01       NaN     NaN
1   2015-01-02       NaN     NaN
2   2015-01-03       NaN     NaN
3   2015-01-04       NaN     NaN
4   2015-01-05       NaN     NaN
5   2015-01-06       NaN     NaN
6   2015-01-07       NaN     NaN
7   2015-01-08       NaN     NaN
8   2015-01-09       NaN     NaN
9   2015-01-10       NaN     NaN
10  2015-01-11       NaN     NaN
11  2015-01-12       NaN     NaN
12  2015-01-13       NaN     NaN
13  2015-01-14       NaN     NaN
14  2015-01-15       NaN     NaN
15  2015-01-16       NaN     NaN

I used the following code to fill up the NaNs in column A:

import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)

Which will give us something like:

          Date           A       B
0   2015-01-01   96.538211     NaN
1   2015-01-02  404.683392     NaN
2   2015-01-03  849.614253     NaN
3   2015-01-04  590.030660     NaN
4   2015-01-05  203.167519     NaN
5   2015-01-06  980.508258     NaN
6   2015-01-07  221.088002     NaN
7   2015-01-08  285.013762     NaN

Solution 3

If you want to replace all NaNs from the DF with random values from a list, you can do something like this:

import numpy as np

df.applymap(lambda l: l if not np.isnan(l) else np.random.choice([1, 3]))

Solution 4

If you want to replace NaN in your column with hot deck technique, I can propose way like this :

def hot_deck(dataframe) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = dataframe[dataframe[col] != 0][col].unique()
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

After if you prefer just replace NaN with a new random value for each iteration you can do a thing like that. You've just to determine the max value of your random choices.

def hot_deck(dataframe,max_value) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = random.sample(range(max_value),dataframe.isnull().sum())
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

View more solutions

17,672

Author by

Sam

Updated on June 23, 2022

Comments

Sam almost 2 years

I have a data frame (data_train) with NaN values, A sample is given below:

republican                n                          y   
republican                n                          NaN   
democrat                 NaN                         n
democrat                  n                          y

I want to replace all the NaN with some random values like .

republican                n                           y   
republican                n                          rnd2
democrat                 rnd1                         n
democrat                  n                           y

How do I do it.

I tried the following, but had no luck:

df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]

when I do the above with a dataframe with random numerical data the above script works fine.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

In the data frame of probabilities over time return first column name where value is < .5 for each row

Converting Pandas Dataframe to numpy array

How to iterate row by row in a pandas dataframe and look for a value in its columns

Write-locked file sometimes can't find contents (when opening a pickled pandas DataFrame) - EOFError: Ran out of input

Pandas dataframe to excel: AttributeError: 'list' object has no attribute 'to_excel'

How to plot multiple time series in Python

Plot line graph from Pandas dataframe (with multiple lines)

Pandas conversion from object to boolean always returns True using astype

Python DataFrames For Loop with If Statement not working

adding values to a column by order pandas python