Plotting the ROC curve of K-fold Cross Validation

12,479

The problem is that I do not clearly understand cross-validation. In the for loop range, I have passed the training sets of X and y variables. Does cross-validation work like this?

Leaving SMOTE and the imbalance issue aside, which are not included in your code, your procedure looks correct.

In more detail, for each one of your n_splits=10:

  • you create train and test folds

  • you fit the model using the train fold:

      classifier.fit(X_train_res[train], y_train_res[train])
    
  • and then you predict probabilities using the test fold:

       predict_proba(X_train_res[test])
    

This is exactly the idea behind cross-validation.

So, since you have n_splits=10, you get 10 ROC curves and respective AUC values (and their average), exactly as expected.

However:

The need for (SMOTE) upsampling due to the class imbalance changes the correct procedure, and turns your overall process incorrect: you should not upsample your initial dataset; instead, you need to incorporate the upsampling procedure into the CV process.

So, the correct procedure here for each one of your n_splits becomes (notice that starting with a stratified CV split, as you have done, becomes essential in class imbalance cases):

  • create train and test folds
  • upsample your train fold with SMOTE
  • fit the model using the upsampled train fold
  • predict probabilities using the test fold (not upsampled)

For details regarding the rationale, please see own answer in the Data Science SE thread Why you shouldn't upsample before cross validation.

Share:
12,479

Related videos on Youtube

Dipto
Author by

Dipto

Updated on June 04, 2022

Comments

  • Dipto
    Dipto almost 2 years

    I am working with an imbalanced dataset. I have applied SMOTE Algorithm to balance the dataset after splitting the dataset into test and training set before applying ML models. I want to apply cross-validation and plot the ROC curves of each folds showing the AUC of each fold and also display the mean of the AUCs in the plot. I named the resampled training set variables as X_train_res and y_train_res and following is the code:

    cv = StratifiedKFold(n_splits=10)
    classifier = SVC(kernel='sigmoid',probability=True,random_state=0)
    
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    plt.figure(figsize=(10,10))
    i = 0
    for train, test in cv.split(X_train_res, y_train_res):
        probas_ = classifier.fit(X_train_res[train], y_train_res[train]).predict_proba(X_train_res[test])
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(y_train_res[test], probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    
        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Chance', alpha=.8)
    
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)
    
    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')
    
    plt.xlim([-0.01, 1.01])
    plt.ylim([-0.01, 1.01])
    plt.xlabel('False Positive Rate',fontsize=18)
    plt.ylabel('True Positive Rate',fontsize=18)
    plt.title('Cross-Validation ROC of SVM',fontsize=18)
    plt.legend(loc="lower right", prop={'size': 15})
    plt.show()
    

    following is the output:

    enter image description here

    Please tell me whether the code is correct for plotting ROC curve for the cross-validation or not.

    • Xela Vi
      Xela Vi about 2 years
      Hi, would you mind sharing the whole correct code (as a solution to your post)? I am facing some issues with smote + roc curves