ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

18,155

Solution 1

I was going through the dataset again after posting the question and I found another column with a NaN. I can't believe I wasted so much time on this when I could have just used the Pandas function to get the list of columns that had NaN. So, using the following code, I found that I missed out three columns. I was visually searching for NaN when I could have just used this function. After handling these new NaNs, the code worked properly.

pd.isnull(train_data).sum() > 0

Result

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

Solution 2

The error is in your other features that you are treating as non-categorical features.

Those columns like 'hedge_value', 'indicator_code' etc contains mixed type data like TRUE, FALSE from the original csv and 2.0 from your fillna() call. The OneHotEncoder is not able to process them.

As mentioned in OneHotEncoder fit() documentation:

 fit(X, y=None)

    Fit OneHotEncoder to X.
    Parameters: 

    X : array-like, shape [n_samples, n_feature]

        Input array of type int.

You can see that it requires all X to be of numerical (int, but float will do) type.

As a workaround you can do this to encode your categorical features:

X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()

And then concatenate this with your non-categorical features.

Solution 3

To use it in production the best practice is to use Imputer and then save in pkl with the model

This is a wrok around

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

Better to use this

Share:
18,155
Parthapratim Neog
Author by

Parthapratim Neog

A problem solver.

Updated on June 04, 2022

Comments

  • Parthapratim Neog
    Parthapratim Neog almost 2 years

    I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

    I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.

    Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

    I am using Python3.6 with Pandas and Sklearn for data processing.

    Code

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    test_data = pd.read_csv('test.csv')
    train_data = pd.read_csv('train.csv')
    
    # Replacing Nan values here
    train_data['status']=train_data['status'].fillna(2.0)
    train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
    train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
    train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
    train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
    train_data['office_id']=train_data['office_id'].fillna('OFF99999999')
    
    x_train = train_data.iloc[:, :-1].values
    y_train = train_data.iloc[:, 17].values
    
    # =============================================================================
    # from sklearn.preprocessing import Imputer
    # imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
    # imputer.fit(x_train[:, 15:17])
    # x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
    # 
    # imputer.fit(x_train[:, 12:13])
    # x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
    # =============================================================================
    
    
    # Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
    # Country name, Purchased status will give trouble
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    labelencoder_X = LabelEncoder()
    x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
    x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
    x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
    x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
    x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
    x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
    x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])
    
    
    # =============================================================================
    # import numpy as np
    # x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
    # x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
    # np.isnan(x_train[:, 3]).any()
    # =============================================================================
    
    
    # =============================================================================
    # from sklearn.preprocessing import StandardScaler
    # sc_X = StandardScaler()
    # x_train = sc_X.fit_transform(x_train)
    # =============================================================================
    
    onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.
    

    Error

    Traceback (most recent call last):
    
      File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
        x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.
    
      File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
        self.categorical_features, copy=True)
    
      File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
        X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
    
      File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
        _assert_all_finite(array)
    
      File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
        " or a value too large for %r." % X.dtype)
    
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
    
  • Parthapratim Neog
    Parthapratim Neog over 6 years
    Hi, actually this was not the issue, somehow the True and False are already converted into 1s and 0s. I found what my issue was, there were three other columns that had NaN's in them, which I missed out. So, just handling those NaN's fixed my issue. Can we connect over mail? I am a newbie in ML, and would love to discuss on further issues.
  • Abhishek Sharma
    Abhishek Sharma over 6 years
    You can use df.columns[df.isnull().sum()>0] to print only the columns having null values
  • Lucas Lago
    Lucas Lago almost 6 years
    You can use df = df.dropna(how='any',axis=0) to delete rows with NaN values