R caret package (rpart): constructing a classification tree

23,511

Solution 1

As far as I can tell, there are two problems:

  • R can't find the appropriate predict function for tree1$finalModel, which should be predict.rpart since tree1$finalModel is of the class rpart. I also get that error and unfortunately don't know the underlying reason. This is also why R does not accept type = "class". predict.rpart would accept it.
  • Supplying the train function with a formula instead of x and y objects leads to the problem that variables like sect_isodev1 can't be found later on

After reproducing your error with random data (resembling your str) using x and y objects and calling predict.rpart explicitly from rpart worked for me:

tree1 = train (y = training$def,
               x = training[, -which(colnames(training) == "def")],
               method = "rpart",
               tuneLength=20,
               metric="ROC",
               trControl = fitControl)
summary(tree1$finalModel)
# This still results in Error: could not find function "predict.rpart":
model.tree1 <- predict.rpart(tree1$finalModel, newdata = testing)
# Explicitly calling predict.rpart from the rpart package works:
rpart:::predict.rpart(object = tree1$finalModel, 
                      newdata = testing, 
                      type = "class") 

By the way, predict(tree1, testing), which means using predict.train with the train object, also works and returns predicted classes. Edit: As Max pointed out, it is usually better to just use this approach instead of making a different predict function work.

Solution 2

Don't use predict.rpart with the train$finalModel unless you have a really good reason. The rpart object does;t know about anything that train did, including pre-process. It may not give you the correct answer. After all, you might be using train in order to avoid the minutia so let predict.train do the work.

Max

EDIT -

About the type = "class" and type = "prob" bit..

predict.rpart defaults to producing class probabilities. Although rpart is one of the earliest packages, that is atypical as most produce classes by default.

predict.train produces the classes by default and you have to use type = "prob" to get probabilities.

Share:
23,511
lorelai
Author by

lorelai

Updated on July 05, 2022

Comments

  • lorelai
    lorelai almost 2 years

    I am struggling for several days to perform a classification tree using the caret package. The problem are my factor variables. I generate the tree, but when I try to use the best model to make predictions on the test sample, it fails, because the train function creates dummies for my factor variables and then the predict function cannot find these newly created dummies in the test set. How should I deal with this problem?

    My code is as follows:

    install.packages("caret", dependencies = c("Depends", "Suggests"))      
    library(caret)                                      
    db=data.frame(read.csv ("db.csv", head=TRUE, sep=";", na.strings ="?"))     
    fix(db)
    db$defaillance=factor(db$defaillance)
    db$def=ifelse(db$defaillance==0,"No","Yes") 
    db$def=factor(db$def)
    db$defaillance=NULL
    db$canal=factor(db$canal)
    db$sect_isodev=factor(db$sect_isodev)
    db$sect_risq=factor(db$sect_risq)       
    
    #delete zero variance predictors                                
    nzv <- nearZeroVar(db[,-78])
    db_new <- db[,-nzv]
    
    inTrain <- createDataPartition(y = db_new$def, p = .75, list = FALSE)                               
    training <- db_new[inTrain,]
    testing <- db_new[-inTrain,]
    str(training)
    str(testing)
    dim(training)
    dim(testing)
    

    A sample o the str() function for training/testing is found below:

     $ FDR        : num  1305 211 162 131 143 ...
     $ FCYC       : num  0.269 0.18 0.154 0.119 0.139 ...
     $ BFDR       : num  803 164 108 72 76 63 100 152 188 80 ...
     $ TRES       : num  502 47 54 59 67 49 53 -7 -103 -109 ...
     $ sect_isodev: Factor w/ 9 levels "1","2","3","4",..: 4 3 3 3 3 3 3 3 3 3 ...
     $ sect_risq  : Factor w/ 6 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
     $ def        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
    > dim(training)
    [1] 14553    42
    > dim(testing)
    [1] 4850   42
    

    Then my code goes like this:

    fitControl <- trainControl(method = "repeatedcv",
                               number = 10,
                               repeats = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
    
    #CART1
    set.seed(1234)
    tree1 = train (def~.,
               training,
               method = "rpart",
               tuneLength=20,
               metric="ROC",
               trControl = fitControl)
    

    A sample of

    summary(tree1$finalModel)
    

    is here

    RNTB          38.397731
    sect_isodev1   6.742289
    sect_isodev3   4.005016
    sect_isodev8   2.520850
    sect_risq3     9.909127
    sect_risq4     6.737908
    sect_risq5     3.085714
    SOLV          73.067539
    TRES          47.906884
    sect_isodev2   0.000000
    sect_isodev4   0.000000
    sect_isodev5   0.000000
    sect_isodev6   0.000000
    sect_isodev7   0.000000
    sect_isodev9   0.000000
    sect_risq0     0.000000
    sect_risq1     0.000000
    sect_risq2     0.000000
    

    And here is the error:

    model.tree1 <- predict(tree1$finalModel,testing) Error in eval(expr, envir, enclos) : object 'sect_isodev1' not found

    I am curious yet about another thing. I have found in Max Kuhn's "Predictive Modelling with R" the following syntax:

    predict(rpartTune$finalModel, newdata, type = "class")
    

    where rpartTune$finalModel is a classification tree identical to mine (or mine identical to his). Now, R doesn't accept type="class". Only type="prob". I am troubled because of that.

    Thank you in advance for your responses