ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

python pandas machine-learning scikit-learn data-science

18,155

Solution 1

I was going through the dataset again after posting the question and I found another column with a NaN. I can't believe I wasted so much time on this when I could have just used the Pandas function to get the list of columns that had NaN. So, using the following code, I found that I missed out three columns. I was visually searching for NaN when I could have just used this function. After handling these new NaNs, the code worked properly.

pd.isnull(train_data).sum() > 0

Result

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

Solution 2

The error is in your other features that you are treating as non-categorical features.

Those columns like 'hedge_value', 'indicator_code' etc contains mixed type data like TRUE, FALSE from the original csv and 2.0 from your fillna() call. The OneHotEncoder is not able to process them.

As mentioned in OneHotEncoder fit() documentation:

 fit(X, y=None)

    Fit OneHotEncoder to X.
    Parameters: 

    X : array-like, shape [n_samples, n_feature]

        Input array of type int.

You can see that it requires all X to be of numerical (int, but float will do) type.

As a workaround you can do this to encode your categorical features:

X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()

And then concatenate this with your non-categorical features.

Solution 3

To use it in production the best practice is to use Imputer and then save in pkl with the model

This is a wrok around

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

Better to use this

18,155

Author by

Parthapratim Neog

A problem solver.

Updated on June 04, 2022

Comments

Parthapratim Neog almost 2 years

I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

I am using Python3.6 with Pandas and Sklearn for data processing.

Code

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
# 
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Parthapratim Neog over 6 years

Hi, actually this was not the issue, somehow the True and False are already converted into 1s and 0s. I found what my issue was, there were three other columns that had NaN's in them, which I missed out. So, just handling those NaN's fixed my issue. Can we connect over mail? I am a newbie in ML, and would love to discuss on further issues.
Abhishek Sharma over 6 years

You can use df.columns[df.isnull().sum()>0] to print only the columns having null values
Lucas Lago almost 6 years

You can use df = df.dropna(how='any',axis=0) to delete rows with NaN values