Linear Regression and group by in R

r regression linear-regression lm

103,850

Solution 1

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Solution 2

Since 2009, dplyr has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.

library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385

To retrieve the coefficients and Rsquared/p.value, one can use the broom package. This package provides:

three S3 generics: tidy, which summarizes a model's statistical findings such as coefficients of a regression; augment, which adds columns to the original data such as predictions, residuals and cluster assignments; and glance, which provides a one-row summary of model-level statistics.

library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
# 
#    state        term    estimate  std.error  statistic   p.value
#   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
# 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
# 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
# 
#    state   r.squared adj.r.squared     sigma statistic   p.value    df
#   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
# 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
# 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
#   df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
# 
#     state   response  year      .fitted   .se.fit     .resid      .hat
#    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
# 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
# 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
# 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
# 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
# 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
# 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
# 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
# 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
# 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
# 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
# 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
# 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
# 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
# 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
# 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
# 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
# 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
# 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)

Solution 3

Here's an approach using the plyr package:

d <- data.frame(
  state = rep(c('NY', 'CA'), 10),
  year = rep(1:10, 2),
  response= rnorm(20)
)

library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df) 
  lm(response ~ year, data = df))

# Apply coef to each model and return a data frame
ldply(models, coef)

# Print the summary of each model
l_ply(models, summary, .print = TRUE)

Solution 4

In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).

library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))

Solution 5

A nice solution using data.table was posted here in CrossValidated by @Zach. I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:

## make fake data
    library(data.table)
    set.seed(1)
    dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))

##calculate the regression coefficient r^2
    dat[,summary(lm(y~x))$r.squared,by=grp]
       grp         V1
    1:   1 0.01465726
    2:   2 0.02256595

as well as all the other output from summary(lm):

dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
   grp         r2        f
1:   1 0.01465726 0.714014
2:   2 0.02256595 1.108173

View more solutions

103,850

Author by

JD Long

Only slightly ashamed creator of disgusting and frustrating code. I'm a data guy not a programmer. But sometimes I have to program my data into submission.

Updated on October 02, 2020

Comments

JD Long over 3 years

I want to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?
JD Long almost 15 years

This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you.
hadley almost 15 years

It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted?
Thierry almost 15 years

One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately.
MikeTP over 11 years

Say you added a additional independent variable that wasn't available in all states (i.e. miles.of.ocean.shoreline) that was represented by NA in your data. Wouldn't the lm call fail? How could it be dealt with?
hadley over 11 years

Inside the function you'd need to test for that case and use a different formula
erc about 10 years

Is it possible to add the name of the subgroup to each call in the summary (last step)?
ToToRo almost 10 years

Is there a way to list R2 for both of these two models? e.g. add an R2 column after year. Also add p-value for each of the coeff?
Brian D about 9 years

if you run layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page and then l_ply(models, plot) you get each of the residuals plots too. Is it possible to label each of the plots with the group (e.g., "state" in this case)?
FraNut over 8 years

@ToToRo here you can find a feaseble solution (better late than never): Your.df[,summary(lm(Y~X))$r.squared,by=Your.factor] where: Y, X and Your.factor are your variables. Please keep in mind that Your.df must be a data.table class
pedram over 8 years

I had to do rowwise(fitted_models) %>% tidy(model) to get the broom package to work, but otherwise, great answer.
holastello over 6 years

Works great... can do this all without leaving the pipe: d %>% group_by(state) %>% do(model = lm(response ~ year, data = .)) %>% rowwise() %>% tidy(model)
randy almost 5 years

A little extension in case you want a column of fitted values or residuals: wrap the lm() call in a resid() call and then pipe everything in the last line into an unnest() call. Of course, you would want to change the variable name from "model" to something more relevant.
Chris Nolte almost 4 years

@pedram and @holastello, this no longer works, at least with R 3.6.1, broom_0.7.0, dplyr_0.8.3. d %>% group_by(state) %>% do(model=lm(response ~year, data = .)) %>% rowwise() %>% tidy(model) Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. ...
Jon Spring about 3 years

Now (dplyr 1.0.4, tidyverse 1.3.0), you can do: library(broom); library(tidyverse) d %>% nest(data = -state) %>% mutate(model = map(data, ~lm(response~year, data = .)), tidied = map(model, tidy)) %>% unnest(tidied)
Recology about 2 years

@JonSpring is it still working this way?
Jon Spring about 2 years

Still works for me, yes. (on dplyr 1.0.8 and tidyverse 1.3.1)
Herman Toothrot about 2 years

I can't find the function xyplot, it doesn't show up after loading lme4