ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data
Solution 1
I was going through the dataset again after posting the question and I found another column with a NaN
. I can't believe I wasted so much time on this when I could have just used the Pandas function to get the list of columns that had NaN
. So, using the following code, I found that I missed out three columns. I was visually searching for NaN
when I could have just used this function. After handling these new NaN
s, the code worked properly.
pd.isnull(train_data).sum() > 0
Result
portfolio_id False
desk_id False
office_id False
pf_category False
start_date False
sold True
country_code False
euribor_rate False
currency False
libor_rate True
bought True
creation_date False
indicator_code False
sell_date False
type False
hedge_value False
status False
return False
dtype: bool
Solution 2
The error is in your other features that you are treating as non-categorical features.
Those columns like 'hedge_value'
, 'indicator_code'
etc contains mixed type data like TRUE
, FALSE
from the original csv and 2.0
from your fillna()
call. The OneHotEncoder is not able to process them.
As mentioned in OneHotEncoder fit()
documentation:
fit(X, y=None)
Fit OneHotEncoder to X.
Parameters:
X : array-like, shape [n_samples, n_feature]
Input array of type int.
You can see that it requires all X to be of numerical (int, but float will do) type.
As a workaround you can do this to encode your categorical features:
X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()
And then concatenate this with your non-categorical features.
Solution 3
To use it in production the best practice is to use Imputer and then save in pkl with the model
This is a wrok around
df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)
Better to use this
Comments
-
Parthapratim Neog almost 2 years
I have two CSV files(Training set and Test Set). Since there are visible
NaN
values in few of the columns (status
,hedge_value
,indicator_code
,portfolio_id
,desk_id
,office_id
).I start the process by replacing the
NaN
values with some huge value corresponding to the column. Then I am doingLabelEncoding
to remove the text data and convert them into Numerical data. Now, when I try to doOneHotEncoding
on the categorical data, I get the error. I tried giving input one by one into theOneHotEncoding
constructor, but I get the same error for every column.Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?
I am using
Python3.6
withPandas
andSklearn
for data processing.Code
import pandas as pd import matplotlib.pyplot as plt import numpy as np test_data = pd.read_csv('test.csv') train_data = pd.read_csv('train.csv') # Replacing Nan values here train_data['status']=train_data['status'].fillna(2.0) train_data['hedge_value']=train_data['hedge_value'].fillna(2.0) train_data['indicator_code']=train_data['indicator_code'].fillna(2.0) train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999') train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999') train_data['office_id']=train_data['office_id'].fillna('OFF99999999') x_train = train_data.iloc[:, :-1].values y_train = train_data.iloc[:, 17].values # ============================================================================= # from sklearn.preprocessing import Imputer # imputer = Imputer(missing_values="NaN", strategy="mean", axis=0) # imputer.fit(x_train[:, 15:17]) # x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17]) # # imputer.fit(x_train[:, 12:13]) # x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13]) # ============================================================================= # Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like # Country name, Purchased status will give trouble from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0]) x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1]) x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2]) x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3]) x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6]) x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8]) x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14]) # ============================================================================= # import numpy as np # x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1) # x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False) # np.isnan(x_train[:, 3]).any() # ============================================================================= # ============================================================================= # from sklearn.preprocessing import StandardScaler # sc_X = StandardScaler() # x_train = sc_X.fit_transform(x_train) # ============================================================================= onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14]) x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.
Error
Traceback (most recent call last): File "<ipython-input-4-4992bf3d00b8>", line 58, in <module> x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding. File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform self.categorical_features, copy=True) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array _assert_all_finite(array) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite " or a value too large for %r." % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
-
Parthapratim Neog over 6 yearsHi, actually this was not the issue, somehow the
True
andFalse
are already converted into1
s and0
s. I found what my issue was, there were three other columns that had NaN's in them, which I missed out. So, just handling those NaN's fixed my issue. Can we connect over mail? I am a newbie in ML, and would love to discuss on further issues. -
Abhishek Sharma over 6 yearsYou can use df.columns[df.isnull().sum()>0] to print only the columns having null values
-
Lucas Lago almost 6 yearsYou can use df = df.dropna(how='any',axis=0) to delete rows with NaN values