Prediction and Confidence intervals for Logistic Regression

10,137

I am not sure if you are asking for the straight up prediction interval, but if you are you can calculate it simply.

You can extract a traditional confidence interval for the model as such:

confint(model)

And then once you run a prediction, you can calculate a prediction interval based on the prediction like so:

upper = predAll$fit + 1.96 * predAll$se.fit
lower = predAll$fit - 1.96 * predAll$se.fit

You are simply taking the prediction (at any given point if you use a single set of predictor variables) and adding and subtracting 1.96 * absolute value of the standard error. (1.96 se includes 97.5% of the normal distribution and represents the 95% interval as it does for the standard deviation in the normal distribution)

This is the same formula that you would use for a traditional confidence interval except that using the standard error (as opposed to the standard deviation) makes the interval wider to account for the uncertainty in prediction itself.

Update:

Method for plotting prediction invervals courtesy of Rstudio!

As requested...though not done by me!

Share:
10,137
Bob Hopez
Author by

Bob Hopez

Updated on June 04, 2022

Comments

  • Bob Hopez
    Bob Hopez almost 2 years

    Below is a set of fictitious probability data, which I converted into binomial with a threshold of 0.5. I ran a glm() model on the discrete data to test if the intervals returned from glm() were 'mean prediction intervals' ("Confidence Interval") or 'point prediction intervals'("Prediction Interval"). It appears from the plot below that the returned intervals are the latter--'Point Prediction Intervals'; note, with 95% confidence, 2/20 points fall outside of the line in this sample.

    If this is indeed the case, how do I generate the the 'mean prediction interval' (i.e, "Confidence Intervals") in R for a binomial data set bound by 0 and 1 using glm()? Please show your code and plot similar to mine with the fit line, given probabilities, 'confidence intervals' and 'prediction intervals'.

    # Fictitious data
    xVal <- c(15,15,17,18,32,33,41,42,47,50,
             53,55,62,63,64,65,66,68,70,79,
             94,94,94,95,98)
    randRatio <- c(.01,.03,.05,.04,.01,.2,.1,.08,.88,.2,
                   .2,.99,.49,.88,.2,.88,.66,.87,.66,.90,
                   .98,.88,.95,.95,.95)
    # Converted to binomial
    randBinom <- ifelse(randRatio < .5, 0, 1)
    
    # Data frame for model
    binomData <- data.frame(
      randBinom = randBinom,
      xVal = xVal
    )
    
    # Model
    mode1 <- glm(randBinom~ xVal, data = binomData, family = binomial(link = "logit"))
    
    # Predict all points in xVal range
    frame <- data.frame(xVal=(0:100))
    predAll <- predict(mode1, newdata = frame,type = "link", se.fit=TRUE)
    
    # Params for intervals and plot
    confidence <- .95
    score <- qnorm((confidence / 2) + .5)
    frame <- data.frame(xVal=(0:100))
    
    #Plot
    with(binomData, plot(xVal, randBinom, type="n", ylim=c(0, 1), 
                     ylab = "Probability", xlab="xVal"))
    lines(frame$xVal, plogis(predAll$fit), col = "red", lty = 1)
    lines(frame$xVal, plogis(predAll$fit + score * predAll$se.fit), col = "red", lty = 3)
    lines(frame$xVal, plogis(predAll$fit - score * predAll$se.fit), col = "red", lty = 3)
    points(xVal, randRatio, col = "red") # Original probabilities
    points(xVal, randBinom, col = "black", lwd = 3) # Binomial Points used in glm
    

    Here's the plot, presumably with 'point prediction intervals' (i.e., "Prediction Intervals") in dashed red, and the mean fit in solid red. Black dots represent the discrete binomial data from original probabilities in randRatio:

    enter image description here

    • IRTFM
      IRTFM over 7 years
      I think your premise is incorrect. I think you are not seeing what you are calling "point prediction intervals" and what most people simply call "prediction intervals". What you are calling 'mean prediction intervals' is (probably) what most people would call "confidence intervals" and those apply to plausible locations of the estimated parameter.
    • Bob Hopez
      Bob Hopez over 7 years
      @42- I edited some of the wording to align better with your comment.
    • Bob Hopez
      Bob Hopez over 7 years
      @ZheyuanLi Please see modified question. I'm interested to see your solution, and even more so, if there is a way using glm(). Using predict() on lm() with "confidence" or "prediction" doesn't seem to be an option with glm(). See: stackoverflow.com/questions/12544090/…
    • IRTFM
      IRTFM over 7 years
      Using type = link gives you confidence intervals (on the logit scale). You are presenting them on the probability scale but they are still not prediction intervals.
    • Bob Hopez
      Bob Hopez over 7 years
      @42- The above uses type = link. I then hear you saying the above plot represents the confidence interval. Is there a type argument for prediction intervals in glm? It appears more like a prediction interval to me, given the data--and that it covers ~95% of the data (red dots).
    • IRTFM
      IRTFM over 7 years
      Think about it a bit. A "prediction" for an "Y" value in a binomial situation would need to be either 1 or 0. None of the predict.glm values are either of those numbers.
    • IRTFM
      IRTFM over 7 years
      predict.glm has no provision for interval ="prediction". If you wnat to see more from R-Core guRus consider doing a bit of searching: markmail.org/search/…
    • Bob Hopez
      Bob Hopez over 7 years
      @ZheyuanLi See bethany's comments below and link to generating the prediction interval. I'm not sure if her suggested method would work for glm binomial problem. Perhaps you're willing to attack the problem using her suggeseted approach.
  • Bob Hopez
    Bob Hopez over 7 years
    Thanks for your approach. I would kindly challenge you to create a plot with both the 'confidence interval' and 'prediction interval' along with the complete code.
  • sconfluentus
    sconfluentus over 7 years
    Why reinvent the wheel...here is a solid concise and brainy way of doing this with ggplot2:
  • sconfluentus
    sconfluentus over 7 years
    And those can be used with GLM as well.
  • Bob Hopez
    Bob Hopez over 7 years
    Thanks; link was broken, but found here. I'm not convinced that SE and STDEV calcs used in linear regression can be applied the same way in logistic regression. Challenge still stands. :)
  • Bob Hopez
    Bob Hopez over 7 years
    Will try it... Or if someone, yourself included, wants to post the answer herein; I'll give them an upvote.
  • sconfluentus
    sconfluentus over 7 years
    I will leave that test to you. I have used this method prior and it works.
  • Bob Hopez
    Bob Hopez over 7 years
    Did you use the binary residuals or the ratio ('probability') residuals? If the sigmoid changes at the beginning or end of the x-range then it will generate lots of skew with either residuals.
  • Bob Hopez
    Bob Hopez over 7 years