How to create a factor interaction variable in R? Why can't I just multiply?

10,095

Solution 1

I think it's very much an R question. When you do this:

NE = location[location==NE]

You might have thought that you were creating a logical variable that could be multiplied by other variables to create an interaction term. Not so. Because the logical comparison was done within the "[" (Extract) operator, it selected only the values of location that equaled the value of the symbol NE (which might or might not have been the value "NE". That's why you got the warning about different lengths.

If it is true that NE == "NE"... and that the location variable had some "NE"'s in it, then you could have just done this:

 NE <- location == NE

That would have replaced the presumably length-1 value of NE with a vector of the same length as location with a bunch of TRUE's and FALSE's. You can multiply other vectors by logicals and will get numeric results where TRUE is converted to 1 and FALSE is 0. Standard Boolean arithmetic does succeed in R. And you can use such variables created in that manner in R's regression functions. It's not the usual way to do that, but it does deliver sensible results.

On the other hand, the formula-method for representing interactions is much more compact and Maxim.K's comment hit the nail on the head. If you built an NE variable as above you could just do something like this:

  lm ( outcome ~ race * NE, data=dfrm) 

The "*" is actually much different in that context. It is not doing multiplication (just as "^" is not a power-operator) when used within a formula. Another slightly clunky method would be:

 lm ( outcome ~ race * I(location=="NE"), data=dfrm)

The I function will return the result of calculating the logical vector. (This assumes that the unstated values of location include "NE"'s. While we're on the topic of constructing interactions you may wnat to look at the %in% function with will allow you easy construction of set membership. Many newbies fail in their efforts to construct proper tests of set membership by doing things like:

  NE.SE <- location == c("NE", "SE")  # almost never TRUE

... when they should do this:

 NE.SE <- location %in% c("NE", "SE")

Solution 2

Creating interactions (and other effects) is well explained in M. Kéry Introduction to winbugs for ecologists. It is also an excellent introduction into simulation techniques. Recommended.

The interaction between race and location does not make much sense to me.

I interpreted your question as: "How can I create interaction effects using factors?". The code answering this question is:

  N=400 # population size
  n=400 # sample size
  race=sample(as.factor(c(rep("white",.8*N), rep("minority",.2*N))),n, replace=T)
  location=sample(as.factor(c(rep("A",.25*N), rep("B",.25*N), rep("C",.25*N), rep("D",.25*N))),n, replace=T)
  (X = as.matrix(model.matrix(~race*location))) # take a look .. nrow columns -> nrow effects
  colnames(X) # show effect names
  # Choose effects
  int <- 12 # Intercept
  race.effects <- c(5) # 1 df -> one effect
  location.effects <- c(3,4,5) # 3 df -> three effects
  interaction.effects <- c(15, 20, 4) # 1*3 df -> three Interaction effects, not necessarily multiplicative
  all.effects <- c(int, race.effects, location.effects, interaction.effects)
  sigma <- 3
  res <- rnorm(n, 0, sigma) # Residuals
  y <- as.numeric(as.matrix(X) %*% as.matrix(all.effects) + res) # multiply data
  lm1 <- lm(y ~ race*location)
  summary(lm1)

The sampling is from an infinite population (replace=T). You may want to use more complex sampling from various defined populations. The more unbalanced the samples, the more problematic re-estimation of the parameters.

So, yes, you can multiply, but it is a matrix multiplication (using %*%).

Share:
10,095
Hutchins
Author by

Hutchins

Updated on June 04, 2022

Comments

  • Hutchins
    Hutchins almost 2 years

    I'm doing an OLS regression, and I'm trying to create an interaction variable. To do this, as far as I know, I just multiple two variables together. However, that is not working.

    Let's say I have variables race (white, minority) and location (NE, S, W, MW). I want to create an interaction effect between all those in the NE and race. So I do:

    >NE = location[location==NE]
    >race_NE = NE*race
    Error in race * NE : non-numeric argument to binary operator
    

    Didn't work. Why?

    I then found the code interaction(). I'm not sure what that means, but it seems to give me something:

    > cat = interaction(NE, race)
    Warning message:
    In ans * length(l) + if1 :
      longer object length is not a multiple of shorter object length
    > freq(cat)
    cat 
                      Frequency Percent
    2 NE.0 white       246    44.4
    2 NE.1 minority       308    55.6
    Total                   554   100.0
    

    I'm not sure if that did what I need it to do so I can use an interaction variable in an lm() model?

    I'm kinda lost here. This may be more a stat and an R question. Please help, thanks