Predict.glm not predicting missing values in response

15,875

Solution 1

When glm fits the model, it uses only the cases where there are no missing values. You can still get predictions for the cases where your y values are missing, by constructing a data frame and passing that to predict.glm.

predict(m, newdata=data.frame(y, x))

Solution 2

The issue is with your call to glm, which has a na.action argument which is set to na.omit

Therefore these values are omited (and when predict.glm is called, they are still omitted)

From ?glm

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

from ?na.exclude (which is general NA action help page)

na.exclude differs from na.omit only in the class of the "na.action" attribute of the result, which is "exclude". This gives different behaviour in functions making use of naresid and napredict: when na.exclude is used the residuals and predictions are padded to the correct length by inserting NAs for cases omitted by na.exclude.

Share:
15,875
generic_user
Author by

generic_user

Updated on June 14, 2022

Comments

  • generic_user
    generic_user almost 2 years

    For some reason, when I specify glms (and lm's too, it turns out), R is not predicting missing values of the data. Here is an example:

    y = round(runif(50))
    y = c(y,rep(NA,50))
    x = rnorm(100)
    m = glm(y~x, family=binomial(link="logit"))
    p = predict(m,na.action=na.pass)
    length(p)
    
    y = round(runif(50))
    y = c(y,rep(NA,50))
    x = rnorm(100)
    m = lm(y~x)
    p = predict(m)
    length(p)
    

    The length of p should be 100, but its 50. The weird thing is that I have other predicts in the same script that do predict from missing data.

    EDIT: It turns out that those other predicts were quite wrong -- I was doing imputed.value = rnorm(N,mean.from.predict,var.of.prediction.interval). This recycled the mean and sd vectors from the lm predict or glm predict functions when length(predict)<N, which was quite different from what I was seeking.

    So my question is what about my example code is stopping glm and lm from predicting missing values?

    Thanks!

  • generic_user
    generic_user about 11 years
    I am indeed constructing imputations. What I want is $X'\hat\beta$ for the $X$ values where $Y$ is missing. Edit: sorry, is there no latex in this forum? I mean fitted values with prediction intervals for new data. I suppose I could do so manually, but I expected predict to, well, predict. Whether I use them for imputations or whatever should be up to me.
  • generic_user
    generic_user about 11 years
    Thanks -- this works, but is odd. I guess the "original" data when newdata is left to default is the model matrix, rather than the variables fed to glm.
  • generic_user
    generic_user about 11 years
    Just to add extra appreciation for this answer: you helped me to find a potentially MASSIVE error in my code. Really grateful.
  • IRTFM
    IRTFM about 11 years
    Downvoting for correct advice about software behavior that doesn't meet ones fantasies is childish.
  • generic_user
    generic_user about 11 years
    I downvoted your comment because it wasn't constructive. You make unfounded assumptions in a mildly hostile tone. Whats more, your answer is not in fact an answer, but rather a comment/a request for further information.
  • IRTFM
    IRTFM over 8 years
    Another downvote to this? I suppose leaving this answer ... and it is an answer (since R does NOT impute for missing values in data given to glm even with other values for na.action)... will continue to annoy people who have difficulty accepting reality. If you want to impute data then you need to use a package that provides that facility.
  • generic_user
    generic_user over 8 years
    Two years ago when I was learning R, you provided a technically correct, if rude and useless answer. A useful answer would have been to explain that predict when applied to a fitted model defaults to the model frame stored in the (g)lm object, which in turn omits observations with missing values. It is a bit dense in fact to assume that anyone would seek imputation from a predict function. If you had actually looked at what I was asking at the time, you would have seen that I wanted predictions of y where x is observed. Imputation usually refers to efforts to account for missing covs
  • IRTFM
    IRTFM over 8 years
    @generic_user: How is it rude to say that R does not return a value from predict.glm for cases that have missing values? You are the only one who has suggested that I am "dense". I just suggested that you appeared to have gotten an incorrect idea. You said you had observed something different and I suggested that you needed to provide a demonstration in code. In a technical forum that's not being "rude", it's being accurate.