Using LASSO in R with categorical variables

20,619

Solution 1

The other answers here point out ways to re-code your categorical factors as dummies. Depending on your application, it may not be a great solution. If all you care about is prediction, then this is probably fine, and the approach provided by Flo.P should be okay. LASSO will find you a useful set of variables, and you probably won't be over-fit.

However, if you're interested in interpreting your model or discussing which factors are important after the fact, you're in a weird spot. The default coding that model.matrix has very specific interpretations when taken by themselves. model.matrix uses what is referred to as "dummy coding". (I remember learning it as "reference coding"; see here for a summary.) That means that if one of these dummies is included, your model now has a parameter whose interpretation is "the difference between one level of this factor and an arbitrarily chosen other level of that factor". And maybe none of the other dummies for that factor were selected. You may also find that if the ordering of your factor levels changes, you end up with a different model.

There are ways to deal with this, but rather than cludge something together, I'd try the group lasso. Building on Flo.P's code above:

install.packages("gglasso")
library(gglasso)


create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
                 var2 = create_factor(5), 
                 var3 = create_factor(5), 
                 var4 = create_factor(5),
                 var5 = rnorm(100),
                 y = rnorm(100))

y <- df$y
x <- model.matrix( ~ ., dplyr::select(df, -y))[, -1]
groups <- c(rep(1:4, each = 4), 5)
fit <- gglasso(x = x, y = y, group = groups, lambda = 1)
fit$beta

So since we didn't specify a relationship between our factors (var1, var2, etc.) and y, the LASSO does a good job and sets all coefficients to 0 except when the minimum amount of regularization is applied. You can play around with values for lambda (a tuning parameter) or just leave the option blank and the function will pick a range for you.

Solution 2

You can make dummy variables from your factor using model.matrix.

I create a data.frame. y is the target variable.

create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
           var2 = create_factor(5), 
           var3 = create_factor(5), 
           var4 = create_factor(5),
           var5 = rnorm(100),
           y = create_factor(2))


    # var1 var2 var3 var4        var5   y
    # 1    a    c    c    b -0.58655607 b
    # 2    d    a    e    a  0.52151994 a
    # 3    a    b    d    a -0.04792142 b
    # 4    d    a    a    d -0.41754957 b
    # 5    a    d    e    e -0.29887004 a

Select all the factor variables. I use dplyr::select_if then parse variables names to get an expression like y ~ var1 + var2 +var3 +var4

library(dplyr)
library(stringr)
library(glmnet)
vars_name <- df %>% 
  select(-y) %>% 
  select_if(is.factor) %>% 
  colnames() %>% 
  str_c(collapse = "+") 

model_string <- paste("y  ~",vars_name )

Create dummy variables with model.matrix. Don't forget the as.formula to coerce character to formula.

 x_train <- model.matrix(as.formula(model_string), df)

Fit your model.

 lasso_model <- cv.glmnet(x=x_train,y = df$y, family = "binomial", alpha=1, nfolds=10)

The code could be simplified. But the idea is here.

Share:
20,619
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I've got a dataset with 1000 observations and 76 variables, about twenty of which are categorical. I want to use LASSO on this entire data set. I know that having factor variables doesn't really work in LASSO through either lars or glmnet, but the variables are too many and there are too many different, unordered values they can take on to reasonably recode them numerically.

    Can LASSO be used in this situation? How do I do this? Creating a matrix of the predictors yields this response:

    hdy<-as.numeric(housingData2[,75])
    hdx<-as.matrix(housingData2[,-75])
    model.lasso <- lars(hdx, hdy)
    Error in one %*% x : requires numeric/complex matrix/vector arguments
    

    I realize that other methods may be easier or more appropriate, but the challenge is actually to do this using lars or glmnet, so if it's possible, I would appreciate any ideas or feedback.

    Thank you,

  • Admin
    Admin over 6 years
    So this all works up until the last part. When I do that, I get the error "Error in glmnet(x, y, weights = weights, offset = offset, lambda = lambda, : number of observations in y (1000) not equal to the number of rows of x (0)" which makes sense when I look at it, because x_train appears to be a matrix of num[0,1:128]. Is that right?
  • Flo.P
    Flo.P over 6 years
    Ok so all your rows have at least one NA. You need to handle your missing values by imputing them. Maybe your have some columns with a lot of NA' that you can remove. When you have a dataset with enough complete rows it may work with: lasso_model <- cv.glmnet(x=x_train,y = na.omit(df$y), family = "binomial", alpha=1, nfolds=10) (I added na.omit df$y)
  • HoneyBuddha
    HoneyBuddha over 2 years
    Can you help explain why var1a, var2a, etc.. are missing? This produces some strange answers. For example, I set some of the vars to have a higher mean and still get all 0 coefficients. So, modify the y systematically (e.g. add 15 to each value) for var1=="a". Your coefficient estimates dont change from 0. This does not seem right at all. Is there a bug in this code?