confusionMatrix for logistic regression in R

39,076

I think there is a problem with the use of predict, since you forgot to provide the new data. Also, you can use the function confusionMatrix from the caret package to compute and display confusion matrices, but you don't need to table your results before that call.

Here, I created a toy dataset that includes a representative binary target variable and then I trained a model similar to what you did.

train <- data.frame(LoanStatus_B = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))

Now, you can predict the data (for example, your training set) and then use confusionMatrix() that takes two arguments:

  • your predictions
  • the observed classes

library(caret)
# Use your model to make predictions, in this example newdata = training set, but replace with your test set    
pdata <- predict(logitMod, newdata = train, type = "response")

# use caret and compute a confusion matrix
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B)

Here are the results

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 66 33
         1  0  1

               Accuracy : 0.67            
                 95% CI : (0.5688, 0.7608)
    No Information Rate : 0.66            
    P-Value [Acc > NIR] : 0.4625          
Share:
39,076
Pumpkin C
Author by

Pumpkin C

Updated on September 04, 2020

Comments

  • Pumpkin C
    Pumpkin C almost 4 years

    I want to calculate two confusion matrix for my logistic regression using my training data and my testing data:

    logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))
    

    i set the threshold of predicted probability at 0.5:

    confusionMatrix(table(predict(logitMod, type="response") >= 0.5,
                          train$LoanStatus_B == 1))
    

    And the the code below works well for my training set. However, when i use the test set:

    confusionMatrix(table(predict(logitMod, type="response") >= 0.5,
                          test$LoanStatus_B == 1))
    

    it gave me an error of

    Error in table(predict(logitMod, type = "response") >= 0.5, test$LoanStatus_B == : all arguments must have the same length
    

    Why is this? How can I fix this? Thank you!

    • user20650
      user20650 almost 7 years
      you need to pass the test dataset to the predict function, otherwise it will make predictions on the train dataset. ie predict(logitMod, newdata=test, type="response")
    • Pumpkin C
      Pumpkin C almost 7 years
      Thx it works!..
  • Pumpkin C
    Pumpkin C almost 7 years
    What is this line doing data = as.numeric(pdata>0.5)
  • Damiano Fantini
    Damiano Fantini almost 7 years
    Your target variable is either 0 or 1, but the prediction returns a value in the range 0 to 1. Therefore you need to convert it to binary (discretization). For example, you test if a value is bigger or smaller than 0.5. TRUE is then converted to 1 (and FALSE to 0) using as.nmeric
  • Pumpkin C
    Pumpkin C almost 7 years
    So it is the threshold, right? I can change it into any 0-1 number i want
  • Pumpkin C
    Pumpkin C almost 7 years
    The last line in the result is "'Positive' Class : 0 ", but in my case i want positive class:1, which is default, can i do that?
  • Damiano Fantini
    Damiano Fantini almost 7 years
    0.5 is the threshold. You are supposed to use the number that best fits your data. 0.5 is a pretty consistent number to start from.
  • Damiano Fantini
    Damiano Fantini almost 7 years
    Sure you can do. The function has an argument for that. Please, check ?confusionMatrix(). For example: confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B, positive = "1")
  • Pumpkin C
    Pumpkin C almost 7 years
    Okay, but here the 1 is a numeric instead of string right?
  • Damiano Fantini
    Damiano Fantini almost 7 years
    In this case, "1" corresponds to your numeric 1s. However, the positive argument is provided as a character! If you care about accuracy, it doesn't matter. But it is important for computing sensitivity/specificity, cause you need to know which are true/false positives. Fore example, try: confusionMatrix(data = as.factor(c("A","B", "B", "B", "A", "A", "A", "A", "B", "B")), reference = as.factor(c("A","A", "A", "B", "A", "A", "A", "A", "B", "A")), positive = "A") and the same line with positive = "B". I hope this was useful. If so, please, validate my answer. Thanks