Specifying formula in R with glm without explicit declaration of each covariate
Solution 1
Your use of .
creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula using as.formula
:
vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)
Of course, you can make the fla
object way more complicated.
Solution 2
Aniko answered your question. To extend a bit :
You can also exclude variables using - :
glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)
For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :
vars <- data.frame(
names = names(samp),
main = c(T,F,T,F),
quadratic =c(F,T,T,F),
main2=c(T,T,F,F),
stringsAsFactors=F
)
regform <- paste(
"Y ~",
paste(
paste(vars[vars$main,1],collapse="+"),
paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
sep="+"
)
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"
> glm(as.formula(regform),data=samp,family=binomial)
Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.
S.R.
Updated on July 09, 2022Comments
-
S.R. almost 2 years
I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.
For example (with just 3 variables):
n=200 set.seed(39) samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5)) samp = transform(samp, # add A A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1))))) samp = transform(samp, # add Y Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
If I want to include all main terms, this has an easy shortcut:
glm(Y~., family=binomial, data=samp)
But say I want to include all main terms (W1, W2, and A) plus W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
Is there a shortcut for this?
[editing self before publishing:] This works!
glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
Okay, so what about this one!
I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.