Scikit-learn - ValueError: Input contains NaN, infinity or a value too large for dtype('float32') with Random Forest

16,495

Solution 1

Since I correct the problem of the edit, I have no more errors. I just replace 3.0x10^-314 values with zeros.

Solution 2

I would presume somewhere in you dataframe you sometimes have nan values.

these can simply be removed using

dataframe1 = dataframe1.dropna()

However, with this approach you could be losing some valueable training data so it may be worth looking into .fillna() or sklearn.preprocessing.Imputer in order to augment some values for the nan cells in the df.

Without seeing the source of dataframe1 it is hard to give a full / complete answer but it is possible that some sort of train, test split is going on resulting in the dataframe being passed only having nan values some of the time.

Share:
16,495
Thomas
Author by

Thomas

Data scientist/Data Engineer

Updated on June 27, 2022

Comments

  • Thomas
    Thomas almost 2 years

    First, I have checked the different posts concerning this error and none of them can solve my issue.

    So I am using RandomForest and I am able to generate the forest and to do a prediction but sometimes during the generation of the forest, I get the following error.

    ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

    This error occurs with the same dataset. Sometimes the dataset creates an error during the training and most of the time not. The error sometimes occurs at the start and sometimes in the middle of the training.

    Here's my code :

    import pandas as pd
    from sklearn import ensemble
    import numpy as np
    
    def azureml_main(dataframe1 = None, dataframe2 = None):
    
        # Execution logic goes here
    
        Input = dataframe1.values[:,:]
        InputData = Input[:,:15]
        InputTarget = Input[:,16:]
    
        limitTrain = 2175
    
        clf = ensemble.RandomForestClassifier(n_estimators = 10000, n_jobs = 4 );
    
        features=np.empty([len(InputData),10])
        j=0
        for i in range (0,14):
            if (i == 1 or i == 4 or i == 5 or i == 6 or i == 8 or i == 9 or  i == 10 or i == 11 or i == 13 or i == 14):
                features[:,j] = (InputData[:, i])
                j += 1     
            
        clf.fit(features[:limitTrain,:],np.asarray(InputTarget[:limitTrain,1],dtype = np.float32))
    
        res = clf.predict_proba(features[limitTrain+1:,:])
    
        listreu = np.empty([len(res),5])
        for i in range(len(res)):
            if(res[i,0] > 0.5):
                listreu[i,4] = 0;
            elif(res[i,1] > 0.5):
                listreu[i,4] = 1;
            elif(res[i,2] > 0.5):
                listreu[i,4] = 2;
            else:
                listreu[i,4] = 3;
        
    
        listreu[:,0] = features[limitTrain+1:,0]
        listreu[:,1] = InputData[limitTrain+1:,2]
        listreu[:,2] = InputData[limitTrain+1:,3]
        listreu[:,3] = features[limitTrain+1:,1]
    
    
    
        # Return value must be of a sequence of pandas.DataFrame
        return pd.DataFrame(listreu),
    

    I ran my code locally and on Azure ML Studio and the error occurs in both cases.

    I am sure that it is not due to my dataset since most of the time I don't get the error and I am generating the dataset myself from different input.

    This is a part of the dataset I use

    EDIT: I probably found out that I had 0 value which were not real 0 value. The values were like

    3.0x10^-314

  • Thomas
    Thomas almost 6 years
    Since I am generating my own dataset I know that it is impossible that there are NaN values in the dataset.
  • Kieran Lavelle
    Kieran Lavelle almost 6 years
    Have you tried the above to verify that? Something somewhere is likely being cast to a nan without you knowing.
  • Thomas
    Thomas almost 6 years
    I am trying to use it but I can not tell you right now if it is working since I am not getting the error 100%
  • Kieran Lavelle
    Kieran Lavelle almost 6 years
    @ThomasR thats fine, just reply once tested for a reasonable number of attempts.
  • Kieran Lavelle
    Kieran Lavelle almost 6 years
    In that case try features=np.empty([len(InputData),10]).astype(np.float64)
  • Thomas
    Thomas almost 6 years
  • Swarit Agarwal
    Swarit Agarwal over 4 years
    This isn't the case