Warning message: "missing values in resampled performance measures" in caret train() using rpart

44,884

Solution 1

Not definitively sure without more data.

If this is regression, the most likely case is that the tree did not find a good split and used the average of the outcome as the predictor. That's fine but you cannot calculate R^2 since the variance of the predictions is zero.

If classification, it's hard to say. You could have a resample where one of the outcome classes has zero samples so sensitivity or specificity is undefined and thus NA.

Solution 2

The Problem

The problem is that the rpart is using a tree based algorithm, which can only handle a limited number of factors in a given feature. So you may have a variable that has been set to a factor with more than 53 categories:

> rf.1 <- randomForest(x = rf.train.2, 
+                      y = rf.label, 
+                      ntree = 1000)
Error in randomForest.default(x = rf.train.2, y = rf.label, ntree = 1000) : 
Can not handle categorical predictors with more than 53 categories.

At the base of your problem, caret is running that function, so make sure you fix up your categorical variables with more than 53 levels.

Here is where my problem lied before (notice zipcode coming in as a factor):

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=as.factor(rf.train.2$zipcode),
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]

The Solution

Remove all categorical variables that have more than 53 levels.

Here is my fixed up code, adjusting the categorical variable zipcode, you could even have wrapped it in a numeric wrapper like this: as.numeric(rf.train.2$zipcode).

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=rf.train.2$zipcode,
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]

Solution 3

This error happens when the model didn't converge in some cross-validation folds the predictions get zero variance. As a result, the metrics like RMSE or Rsquared can't be calculated so they become NAs. Sometimes there are parameters you can tune for better convergence, e.g. the neuralnet library offers to increase threshold which almost always leads to convergence. Yet, I'm not sure about the rpart library.

Another reason for this to happen is that you have already NAs in your training data. Then the obvious cure is to remove them before passing them by train(data = na.omit(training.data)).

Hope that enlightens a bit.

Share:
44,884
USER_1
Author by

USER_1

Data scientist and computational biologist.

Updated on January 18, 2022

Comments

  • USER_1
    USER_1 over 2 years

    I am using the caret package to train a model with "rpart" package;

    tr = train(y ~ ., data = trainingDATA, method = "rpart")
    

    Data has no missing values or NA's, but when running the command a warning message comes up;

        Warning message:
    In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
      There were missing values in resampled performance measures.
    

    Does anyone know (or could point me to where to find an answer) what does this warning mean? I know it is telling me that there were missing values in resampled performance measures - but what does that exactly mean and how can a situation like that arise? BTW, the predict() function works fine with the fitted model, so it is just my curiosity.

  • USER_1
    USER_1 over 9 years
    Thanks @topepo . It is regression so no good split is a plausibe reason. BTW, do you know of any good book to explaining linear regression with random forest?
  • Samuel-Rosa
    Samuel-Rosa over 7 years
    @topepo, I have been experiencing the same problem with rpart and nnet. For the latter I simply had to set linout = TRUE to get rid of the warning message and obtain proper cross-validation predictions. However, I could not find a solution for rpart yet: cross-validation predictions were perfectly fine. I have the feeling that rpart is expecting some argument which we cannot pass using train such as method = "anova". The help page of rpart says that "it is wisest to specify the method directly".
  • Jørgen K. Kanters
    Jørgen K. Kanters over 6 years
    I have only male female and got the same error message
  • Brad Davis
    Brad Davis over 6 years
    Perhaps I'm wrong, but it's unclear to me that it would be wise to convert something like zip code into an integer. If it's an integer, then the algorithm is going to treat it like a covariate instead of a factor, so the zipcode 55105 is one unit greater than 55104, when really they don't have that kind of relationship. I think you'd be better to reduce the precision of the zipcode down perhaps just to the first two digits. I realize this discussion is kind of stale, but I thought it was worth discussing anyway.
  • yPennylane
    yPennylane over 5 years
    I have the same problem with random forest (method = rf), but only if the number of rows in the data set is too small.With a bigger data set (same structure as smaller one) the warning doesn't occur.
  • StupidWolf
    StupidWolf over 4 years
    this could be the problem, but the OP said the data has no missing values or NAs..