Understanding of minbucket function in CART model using R

11,566

From the documentation for the rpart package:

minbucket

the minimum number of observations in any terminal node. If onlyone of minbucket or minsplit is specified, the code either sets minsplit tominbucket*3 or minbucket to minsplit/3, as appropriate.

Setting minbucket to 1 is meaningless, since each leaf node will (by definition) have at least one observation on it. If you set it to a higher value, say 3, then it would mean that every leaf node would have at least 3 observations in that bucket.

The smaller the value of minbucket, the more precise your CART model will be. By setting minbucket to too small a value, such as 1, you may run the risk of overfitting your model.

Share:
11,566
GBOT
Author by

GBOT

Updated on June 07, 2022

Comments

  • GBOT
    GBOT about 2 years

    Assume the training data is "fruit", which I am going to use it for predict using CART model in R

    > fruit=data.frame(
                       color=c("red",   "red",  "red",  "yellow", "red","yellow",
                               "orange","green","pink", "red",‌    ​"red"),
                       isApple=c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE,
                                 FALSE,FALSE,FALSE,FALSE,TRUE))
    
    > mod = rpart(isApple ~ color, data=fruit, method="class", minbucket=1)
    
    > prp(mod)
    

    Could anyone explain what is exactly the role of minbucket in plotting CART tree for this example if we are going to use minbucket = 2, 3, 4, 5?

    See i have 2 variables color & isApple. Color variable has green, yellow, pink, orange and Red. is Apple variable has value TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped with it. Red value appear five times. if i give minbucket = 1,2,3 then it is splitting. If I give minbucket = 4 or 5 then no split occurs though red appears five times.

  • GBOT
    GBOT about 9 years
    fruit=data.frame( color=c("red","red","red","yellow","red","yellow","orange","‌​green","pink","red",‌​"red"), isApple=c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,‌​FALSE,TRUE)) say this is my data frame, we r finding whether the outcome is apple or not?????? we hav 5 red apples, 1 tomato here, so what eva is red need not be an apple. but if i give minbucket=5 or 4 here there is no split at all. for min bucket 1 to 3 there is a split beyond 3 there is no split. But i have more than 3 observation in my leaf node. Please upvote my question thanks.... @tim-biegeleisen
  • GBOT
    GBOT about 9 years
    stackoverflow.com/users/3710546/pascal. I have edited the original question. is it understandable now????
  • Tim Biegeleisen
    Tim Biegeleisen about 9 years
    Could you ask a new question?
  • GBOT
    GBOT about 9 years
    stackoverflow.com/users/1863229/tim-biegeleisen. See i have 2 variables color & isApple. Color variable has green, yellow, pink, orange and Red. is Apple variable has value TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped with it. Red value appear five times. if i give minbucket = 1,2,3 then it is splitting. If i give minbucket =4 or 5 ther is no split occurs though red appears five times. Sorry i could not attach screenshot, i need 10 reputation to attach. :( :(