R randomForest for classification

19,633

Solution 1

Did you try regression on the same data? if not, then check out for "Inf" values in your data and try to remove it if any, after removing NAs and NaNs. You can find useful information regarding removing Inf from below,

R is there a way to find Inf/-Inf values?

Example,

Class V1    V2  V3  V4  V5  V6  V7  V8  V9
1   11  Inf 4   232 23  2   2   34  0.205567767
1   11  123 4   232 23  1   2   34  0.162357601
1   13  123 4   232 23  1   2   34  -0.002739357
1   13  123 4   232 23  1   2   34  0.186989878
2   67  14  4   232 67  1   2   34  0.109398677
2   67  14  4   232 67  2   2   34  0.18491187
2   67  14  4   232 34  2   2   34  0.098728256
2   44  769.03  4   21  34  2   2   34  0.204405869
2   44  34  4   11  34  1   2   34  0.218426408

# When Classification was performed, following error pops out.
rf_model<-randomForest(as.factor(Class)~.,data=data,importance=TRUE,proximity=TRUE)
Error in randomForest.default(m, y, ...) : 
NA/NaN/Inf in foreign function call (arg 1)

# Regression was performed, following error pops out.
rf_model<-randomForest(Class~.,data=data,importance=TRUE,proximity=TRUE)
Error in randomForest.default(m, y, ...) : 
NA/NaN/Inf in foreign function call (arg 1)

So, please check your data very carefully. In addition: Warning message: In randomForest.default(m, y, ...) : The response has five or fewer unique values. Are you sure you want to do regression?

Solution 2

Apart from the obvious facts around presence of NAs etc. this error is almost always caused by the presence of Character feature types in the data set. The way to understand this is by considering what random forest really does. You are partitioning the data set feature by feature. So if one of the feature is a Character vector, how would you partition the data set? You need categories to partition a data. How many 'male' vs. 'female' - categories...

For numeric features like Age, or price, you can create categories by bucketing; greater than certain age, lesser than certain price etc. You cannot do that with pure character features. Therefore you need them as factors in your data set.

Solution 3

In general,there are 2 main reasons you get this error message:

  1. If the data frame contains a character vector column instead of factors. Just convert your character column to a factor

2.If the data contains bad values, applying random forest will also generate this error.The head won't display the outlier values. For ex:

x = rep( x = sample(c(0,1)), times = 24 )

y = c(sample.int(n=50,size = 40),Inf,Inf)

df = data.frame(col1 = x , col2 = y )

head(df)
    col1 col2
>  1    1   26
>  2    0   33
>  3    1   23
>  4    0   21
>  5    1   45
>  6    0   27

Now applying randomForest on df will cause the same error:

model = randomForest(data = df , col2 ~ col1 , ntree = 10)

Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 2)

Solution: Lets identify the bad values in the df. As posted above is.finite() method checks whether the input vector contains proper finite values or not. For ex:

is.finite(c(5,6,1000000,NaN,Inf))
[1] TRUE TRUE TRUE FALSE FALSE

Now lets identify the columns containing the bad values in our data frame and count them.

sum(!is.finite(as.vector(df[,names(df) %in% c("col2")])))
[1] 4
sum(!is.finite(as.vector(df[,names(df) %in% c("col1")])))
[1] 0

Lets drop these records and take just take the good records :

df1 =df[is.finite(as.vector(df[,names(df) %in% c("col2")])) &
is.finite(as.vector(df[,names(df) %in% c("col1")])) , ]

And run the randomForest once again:

model1 = randomForest(data = df1 , col2 ~ col1 , ntree = 10)
Call:
randomForest(formula = col2 ~ col1, data = df1, ntree = 10)

Solution 4

Simply by converting all columns to factor, you can avoid this error. Even i was facing this error. The column,specifically which was not getting converted into factor. I wrote specially as.factor for that. And finally my code worked.

Share:
19,633
user1799242
Author by

user1799242

Updated on June 11, 2022

Comments

  • user1799242
    user1799242 almost 2 years

    I am trying to do classification with randomForest, but I am repeatedly getting an error message for which there seems to be no apparent solution (randomForest has worked well for me doing regression in the past). I have pasted my code below. 'success' is a factor, all of the dependent variables are numbers. Any suggestions as to how to run this classification properly?

    > rf_model<-randomForest(success~.,data=data.train,xtest=data.test[,2:9],ytest=data.test[,1],importance=TRUE,proximity=TRUE)
    
    Error in randomForest.default(m, y, ...) : 
      NA/NaN/Inf in foreign function call (arg 1)
    

    also, here is a sample of the dataset:

    head(data)

    success duration  goal reward_count updates_count comments_count backers_count     min_reward_level max_reward_level
    True 20.00000  1500           10            14              2            68                1             1000
    True 30.00000  3000           10             4              3            48                5             1000
    True 24.40323 14000           23             6             10           540                5             1250
    True 31.95833 30000            9            17              7           173                1            10000
    True 28.13211  4000           10            23             97          2936               10              550
    True 30.00000  6000           16            16            130          2043               25              500