Missing values in scikits machine learning

38,475

Solution 1

Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them.

Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs.

The above answer is outdated; the latest release of scikit-learn has a class Imputer that does simple, per-feature missing value imputation. You can feed it arrays containing NaNs to have those replaced by the mean, median or mode of the corresponding feature.

Solution 2

I wish I could provide a simple example, but I have found that RandomForestRegressor does not handle NaN's gracefully. Performance gets steadily worse when adding features with increasing percentages of NaN's. Features that have "too many" NaN's are completely ignored, even when the nan's indicate very useful information.

This is because the algorithm will never create a split on the decision "isnan" or "ismissing". The algorithm will ignore a feature at a particular level of the tree if that feature has a single NaN in that subset of samples. But, at lower levels of the tree, when sample sizes are smaller, it becomes more likely that a subset of samples won't have a NaN in a particular feature's values, and a split can occur on that feature.

I have tried various imputation techniques to deal with the problem (replace with mean/median, predict missing values using a different model, etc.), but the results were mixed.

Instead, this is my solution: replace NaN's with a single, obviously out-of-range value (like -1.0). This enables the tree to split on the criteria "unknown-value vs known-value". However, there is a strange side-effect of using such out-of-range values: known values near the out-of-range value could get lumped together with the out-of-range value when the algorithm tries to find a good place to split. For example, known 0's could get lumped with the -1's used to replace the NaN's. So your model could change depending on if your out-of-range value is less than the minimum or if it's greater than the maximum (it could get lumped in with the minimum value or maximum value, respectively). This may or may not help the generalization of the technique, the outcome will depend on how similar in behavior minimum- or maximum-value samples are to NaN-value samples.

Solution 3

Replacing a missing value with a mean/median/other stat may not solve the problem as the fact that the value is missing may be significant. For example in a survey on physical characteristics a respondent may not put their height if they were embarrassed about being abnormally tall or small. This would imply that missing values indicate the respondent was unusually tall or small - the opposite of the median value.

What is necessary is a model that has a separate rule for missing values, any attempt to guess the missing value will likely reduce the predictive power of the model.

e.g:

df['xvariable_missing'] = np.where(df.xvariable.isna(),1,0)
df.xvariable = df.xvariable.fillna(df.xvariable.median())

Solution 4

I have come across very similar issue, when running the RandomForestRegressor on data. The presence of NA values were throwing out "nan" for predictions. From scrolling around several discussions, the Documentation by Breiman recommends two solutions for continuous and categorical data respectively.

  1. Calculate the Median of the data from the column(Feature) and use this (Continuous Data)
  2. Determine the most frequently occurring Category and use this (Categorical Data)

According to Breiman the random nature of the algorithm and the number of trees will allow for the correction without too much effect on the accuracy of the prediction. This I feel would be the case if the presence of NA values is sparse, a feature containing many NA values I think will most likely have an affect.

Solution 5

Orange is another python machine learning library that has facilities dedicated to imputation. I have not had a chance to use them, but might be soon, since the simple methods of replacing nan's with zeros, averages, or medians all have significant problems.

Share:
38,475
Vladtn
Author by

Vladtn

Updated on August 24, 2020

Comments

  • Vladtn
    Vladtn over 3 years

    Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that.

  • Tobias Domhan
    Tobias Domhan about 11 years
    I just wanted to note that the randomforest handles nan values well
  • B_Miner
    B_Miner almost 10 years
    Is this still the case with GBM? Is there a reason the scikit learn version does not mirror the coding in R which deals gracefully with missing values? It is one of the attractive features of that algorithm and I wish it were coded as such in scikit.
  • Fred Foo
    Fred Foo almost 10 years
    @B_Miner: it's the case for practically all scikit-learn estimators. Missing value handling is done separately from learning, but the two can be combined using a Pipeline.
  • user48956
    user48956 over 7 years
    True. Also, in many cases data may be very heavily NULL (let's says 20% of each row is filled). DT-based method should be able to slurp this up -- and if used correctly, the model can be much better than one with imputed data.
  • arun
    arun over 7 years
    One needs to be smart about what to impute the missing values to, not just choose mean, median or mode. In some cases 0 may make the most sense, in which case one can use df[column_name].fillna(0, inplace=True). Also for a feature like monthsSinceLastDelinquency, imputing missing values to a value outside the valid range makes the most sense. Decision trees can make meaningful splits with this and the information from your data are preserved.
  • Mohammad Mahdi KouchakYazdi
    Mohammad Mahdi KouchakYazdi about 7 years
    @Johnny V What do you mean by "a model with separate rule" ?
  • Johnny V
    Johnny V about 7 years
    The strategy I use is to add a missing value flag column as well as filling in the missing value with the median or mode. What this does is to get the record to be treated as normal by the model unless the missing value is significant, in which case the model will make a rule for the missing data flag. For example "if record has missing data point then classify as type A"
  • pansen
    pansen over 6 years
    Great idea to use "clearly out-of-range" values for missing values. This should work considerably well for Decision Trees.
  • Dave Liu
    Dave Liu over 4 years
    I use -1 a lot. It seems like I need to try something more obvious like "-999" for better NaN performance.