confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

17,607

Solution 1

max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738

and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)

In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why @phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.

PS: It will be difficult to tell root cause of error without you posting your data

Solution 2

I had similar problem, which refers to this error. I used function confusionMatrix:

confusionMatrix(actual, predicted, cutoff = 0.5)

An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.

I checked couple of things like:

class(actual) -> numeric

class(predicted) -> integer

unique(actual) -> plenty values, since it is probability

unique(predicted) -> 2 levels: 0 and 1

I concluded, that there is problem with applying cutoff part of the function, so I did it before by:

predicted<-ifelse(predicted> 0.5,1,0)

and run the confusionMatrix function, which works now just fine:

cm<- confusionMatrix(actual, predicted) cm$table

which generated correct outcome.

One takeaway for your case, which might improve interpretation once you make code working: you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:

conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)

you should have written:

conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)

As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.

Hope it helps.

Share:
17,607
SaikiHanee
Author by

SaikiHanee

Updated on July 18, 2022

Comments

  • SaikiHanee
    SaikiHanee almost 2 years

    I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.

    set.seed(1234)
    splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
    trainSplit <- attended_num_new_bstTree[ splitIndex,]
    testSplit <- attended_num_new_bstTree[-splitIndex,]
    
    ctrl <- trainControl(method = "cv", number = 5)
    model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
    
    
    predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
    pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
    
    
    plot.roc(auc_bstTree)
    
    conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
    

    But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) : The data must contain some levels that overlap the reference.'

     max(pred_bstTree)
    [1] 1.03385
     min(pred_bstTree)
    [1] 1.011738
    
    > unique(trainSplit$adverse_effects)
    [1] 0 1
    Levels: 0 1
    

    How can I fix this issue?

    > head(trainSplit)
       type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
    5     2          1              14           13      2               0                  0             0                0
    7     2          0              14           13      2               0                  0             0                0
    8     2          0              14           13      2               0                  0             0                0
    9     2          0              14           13      2               1                  0             0                0
    11    2          1              14           13      2               0                  0             0                0
    12    2          0              14           13      2               0                  0             0                0
       uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
    5              5                1         1         1  22 3.000     0
    7              5                0         1         1  22 4.320     0
    8              5                0         1         1  22 4.752     0
    9              5                0         1         1  22 5.000     0
    11             5                1         1         1  22 5.000     0
    12             5                0         1         1  22 5.000     0
    
    • phiver
      phiver over 7 years
      Looks like you are predicting regression not classification. Check if adverse_effects is set as a factor in your data.
    • SaikiHanee
      SaikiHanee over 7 years
      Yes, it is a factor phiver containing 0 and 1. Even when i predict after converting to numeric i get the same error
    • phiver
      phiver over 7 years
      Try adding a sample of your data. It is difficult to see where the problem is.
  • SaikiHanee
    SaikiHanee over 7 years
    abhiieor, the data set contains nearly 40000 records but 88% of the data belongs to class 0 and the rest belongs to class 1.
  • abhiieor
    abhiieor over 7 years
    Data you have given is too little to replicate. I hope while making adverse_effects factor you have done either model_bstTree <- train(as.factor(adverse_effects) ~ ., data = trainSplit, method = "bstTree", trControl = ctrl) or else attended_num_new_bstTree$adverse_effects <- as.factor(attended_num_new_bstTree$adverse_effects). if yes then I would suggest you to try any other classification method say logistic regression, random forest, GBM etc. to see if you see same behavior. Ideally you will not get same behavior.