Error in model.frame.default for Predict() - "Factor has new levels" - For a Char Variable

16,739

The person that answered the question in the post you linked to already gave an indication on why myCharVar is still considered in the model. When you use z~.-y, the formula basically expands to z~(x+y)-y.

Now, to answer your other question: Consider the following quote from the predict() documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".

I think we can assume that the same kind of behaviour occurs for myCharVar. The myCharVar values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar that were never encountered during the training of the model (note that the glm function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.

In this post there is another clarification given on the issue.

Share:
16,739
Max Power
Author by

Max Power

Some of my Stack Overflow answers: Plot classifier's decision boundary Parallel Processing of Very Large Text File Pretty-print confusion matrix Consistent one-hot-encoding of data in batch Input shape error for hidden layers of Keras LSTM (RNN) Model Hyperparameter Tuning in Pyhon/Scikit-learn Separately Scale/OHE Numeric/Categorical Columns Scipy integral given function, bounds

Updated on July 18, 2022

Comments

  • Max Power
    Max Power almost 2 years

    I have a dataset I split into test/train datasets. Immediately following that split I produced a logistic model with:

    logModel1 = glm(Y ~ . -var1 -var2 -var3, data=train, family=binomial)
    

    If I use that model to make predictions on the same train set, I get no error (though of course a not-super-useful test of my model). So I used the code below to predict on my test set:

    predictLog1 <- predict(logModel1, type="response", newdata=test)
    

    But I get the following error:

    Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor myCharVar has new levels This is an observation of myCharVar, This is another...

    Here's what's got me particularly confused:

    • myCharVar is a character variable in both my train and test sets. I've confirmed this with str(test$myCharVar) and str(train$myCharVar)
    • My model does not even use myCharVar as part of the prediction.

    I found an explanation for bullet 2 at this SO link: "Factor has new levels" error for variable I'm not using

    And the suggestion there to remove the character variables altogether from my train and test sets has provided me a workaround so at least I'm not held up. But that seems pretty inelegant, as opposed to just removing them from the model with "-myCharVar". If anyone understands why a character variable in my test set would throw a "factor has new levels" error I'd certainly be interested.