Applying k-fold Cross Validation model using caret package

49,882

Solution 1

when you perform k-fold cross validation you are already making a prediction for each sample, just over 10 different models (presuming k = 10). There is no need make a prediction on the complete data, as you already have their predictions from the k different models.

What you can do is the following:

train_control<- trainControl(method="cv", number=10, savePredictions = TRUE)

Then

model<- train(resp~., data=mydat, trControl=train_control, method="rpart")

if you want to see the observed and predictions in a nice format you simply type:

model$pred

Also for the second part of your question, caret should handle all the parameter stuff. You can manually try tune parameters if you desire.

Solution 2

An important thing to be noted here is not confuse model selection and model error estimation.

You can use cross-validation to estimate the model hyper-parameters (regularization parameter for example).

Usually that is done with 10-fold cross validation, because it is good choice for the bias-variance trade-off (2-fold could cause models with high bias, leave one out cv can cause models with high variance/over-fitting).

After that, if you don't have an independent test set you could estimate an empirical distribution of some performance metric using cross validation: once you found out the best hyper-parameters you could use them in order to estimate de cv error.

Note that in this step the hyperparameters are fixed but maybe the model parameters are different accross the cross validation models.

Solution 3

In the first page of the short introduction document for caret package, it is mentioned that the optimal model is chosen across the parameters. As a starting point, one must understand that cross-validation is a procedure for selecting best modeling approach rather than the model itself CV - Final model selection. Caret provides grid search option using tuneGrid where you can provide a list of parameter values to test. The final model will have the optimized parameter after training is done.

Share:
49,882
pmanDS
Author by

pmanDS

Updated on October 15, 2020

Comments

  • pmanDS
    pmanDS over 3 years

    Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this:

    1. Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds.
    2. If acceptable then train the model on the complete data set.

    I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using.

    # load libraries
    library(caret)
    library(rpart)
    
    # define training control
    train_control<- trainControl(method="cv", number=10)
    
    # train the model 
    model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
    
    # make predictions
    predictions<- predict(model,mydat)
    
    # append predictions
    mydat<- cbind(mydat,predictions)
    
    # summarize results
    confusionMatrix<- confusionMatrix(mydat$predictions,mydat$resp)
    

    I have one question regarding the caret train application. I have read A Short Introduction to the caret Package train section which states during the resampling process the "optimal parameter set" is determined.

    In my example have I coded it up correctly? Do I need to define the rpart parameters within my code or is my code sufficient?

  • skan
    skan about 2 years
    If you get 10 different models, one per folding, how do you get the final overall model from them? Averaging all?