Selecting CP value for decision tree pruning using rpart

12,798

Solution 1

Generally, a cptable like the one you have, is a warning that the tree is probably no use at all and probably not able to generalise well on to future data. So the answer is not to find another way to choose cp but rather to create a useful tree if you can, or to admit defeat and say that based on the examples and features that we have, we cannot create a model that is predictive of kyphosis.

In your case, all is not - necessarily - lost. The data is very small and the cross validation which gives rise to the xerror column is very volatile. If you seed your seed to 2 or to 3 you will see very different answers in that column (some even worse).

So one thing which is interesting on this data, is to increase the number of cross-validation folds to the number of observations (so that you get LOOCV). If you do this:

myFormula <- Kyphosis ~ Age + Number + Start
rpart_1 <- rpart(myFormula, data = kyphosis,
                 method = "class", 
                 control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
rpart_1$cptable

you will find a CP table that you will like better! (Note that setting a seed is not necessary any more since the folds are the same each time).

Solution 2

In general (and considering parsimony) you should prefer the smaller tree from those with minimum xerror value, this is, any of those whose xerror value is within [min(xerror) - xstd; min(xerror) + xstd].

According to rpart vignette: "Any risk within one standard error of the achieved minimum is marked as being equivalent to the minimum (i.e. considered to be part of the flat plateau). Then the simplest model, among all those “tied” on the plateau, is chosen."

See: https://stackoverflow.com/a/15318542/2052738

You can select the most appropriate cp value (to prune the initial your.tree, overfitted with rpart) with an ad-hoc function such as:

cp.select <- function(big.tree) {
  min.x <- which.min(big.tree$cptable[, 4]) #column 4 is xerror
  for(i in 1:nrow(big.tree$cptable)) {
    if(big.tree$cptable[i, 4] < big.tree$cptable[min.x, 4] + big.tree$cptable[min.x, 5]) return(big.tree$cptable[i, 1]) #column 5: xstd, column 1: cp 
  }
}

pruned.tree <- prune(your.tree, cp = cp.select(your.tree))

[In your particular example, all trees are equivalent so size 1 (no splits) is to be preferred, as the selected response already explained]

Share:
12,798
Ivan
Author by

Ivan

Updated on June 07, 2022

Comments

  • Ivan
    Ivan almost 2 years

    I understand that the common practice to select CP value is by choosing the lowest level with the minimum xerror value. However, in my following case, using cp <- fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"] will give me 0.17647059 which will result in no split or just root after pruning with this value.

    > myFormula <- Kyphosis~Age+Number+Start
    > set.seed(1)
    > fit <- rpart(myFormula,data=data,method="class",control=rpart.control(minsplit=20,xval=10,cp=0.01))
    > fit$cptable
              CP nsplit rel error   xerror      xstd
    1 0.17647059      0 1.0000000 1.000000 0.2155872
    2 0.01960784      1 0.8235294 1.000000 0.2155872
    3 0.01000000      4 0.7647059 1.058824 0.2200975
    

    Is there any other alternative/ good practice to select the CP value?

  • FairMiles
    FairMiles over 5 years
    If you got computing time to spare, control = rpart.control(xval = [data.length], minsplit = 2, minbucket = 1, cp = 0) will give you the most overfitted sequence of trees with the most informative k-fold cross-validation. With plotcp(model) and printcp(model) you can explore the whole range of possible trees