One-Hot Encoding in [R] | Categorical to Dummy Variables

26,877
dd <- read.table(text="
   RACE        AGE.BELOW.21     CLASS
   HISPANIC          0          A
   ASIAN             1          A
   HISPANIC          1          D
   CAUCASIAN         1          B",
  header=TRUE)


  with(dd,
       data.frame(model.matrix(~RACE-1,dd),
                  AGE.BELOW.21,CLASS))
 ##   RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS
 ## 1         0             0            1            0     A
 ## 2         1             0            0            1     A
 ## 3         0             0            1            1     D
 ## 4         0             1            0            1     B

The formula ~RACE-1 specifies that R should create dummy variables from the RACE variable, but suppress the intercept (so that each column represents whether an observation comes from a specified category); the default, without -1, is to make the first column an intercept term (all ones), omitting the dummy variable for the baseline level (first level of the factor) from the model matrix.

More generally, you might want something like

 dd0 <- subset(dd,select=-CLASS)
 data.frame(model.matrix(~.-1,dd0),CLASS=dd$CLASS)

Note that when you have multiple categorical variables you will have to something a little bit tricky if you want full sets of dummy variables for each one. I would think of cbind()ing together separate model matrices, but I think there's also some trick for doing this all at once that I forget ...

Share:
26,877

Related videos on Youtube

EFL
Author by

EFL

Researcher

Updated on September 20, 2020

Comments

  • EFL
    EFL over 3 years

    I need to create a new data frame nDF that binarizes all categorical variables and at the same time retains all other variables in a data frame DF. For example, I have the following feature variables: RACE (4 types) and AGE, and an output variable called CLASS.

    DF =

                  RACE     AGE (BELOW 21)      CLASS
    Case 1    HISPANIC                  0          A
    Case 2       ASIAN                  1          A
    Case 3    HISPANIC                  1          D
    Case 4   CAUCASIAN                  1          B
    

    I want to convert this into nDF with five (5) variables or four (4) even:

              RACE.1    RACE.2    RACE.3      AGE (BELOW 21)     CLASS
    Case 1         0         0         0                   0         A
    Case 2         0         0         1                   1         A
    Case 3         0         0         0                   1         D
    Case 4         0         1         0                   1         B
    

    I am familiar with the treatment contrast to the variable DF$RACE. However, if I implement

    contrasts(DF$RACE) = contr.treatment(4)
    

    what I get is still a DF of three variables, but with variable DF$RACE having the attribute "contrasts."

    What I ultimately want though is a new data frame nDF as illustrated above, but which can be very tedious to evaluate if one has around 50 feature variables, with more than five (5) of them being categorical variables.

    • Ben
      Ben over 6 years
      If you're open to using the data.table package, you can use the one_hot() method from mltools.
  • EFL
    EFL almost 10 years
    I'll definitely try this one you suggested here and explore some more with cbind(). This is truly helpful. I would have voted up your answer up if I had more reputation count.
  • kravi
    kravi almost 9 years
    I am not able to understand the meaning of ~RACE-1?
  • Ben Bolker
    Ben Bolker almost 9 years
    RACE says to translate the categorical variable into dummy variables according to treatment contrasts; -1 says to omit the intercept term
  • dynamo
    dynamo about 8 years
    Note that numerically encoded columns must be stored as character or factor, otherwise model.matrix will leave it as it.