Aggregate / summarize multiple variables per group (e.g. sum, mean)
Solution 1
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Solution 2
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
Solution 3
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables using select_helpers
# see ?select_helpers for more options
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(starts_with('x')), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(matches('.*[0-9]')), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
Solution 4
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package (column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
Solution 5
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
-
data.frame
asx
and - a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Related videos on Youtube
MikeTP
Energy commodity trader learning R to better analyze big high frequency time series data.
Updated on July 21, 2022Comments
-
MikeTP almost 2 years
From a data frame, is there a easy way to aggregate (
sum
,mean
,max
et c) multiple variables simultaneously?Below are some sample data:
library(lubridate) days = 365*2 date = seq(as.Date("2000-01-01"), length = days, by = "day") year = year(date) month = month(date) x1 = cumsum(rnorm(days, 0.05)) x2 = cumsum(rnorm(days, 0.05)) df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the
x1
andx2
variables from thedf2
data frame by year and month. The following code aggregates thex1
variable, but is it also possible to simultaneously aggregate thex2
variable?### aggregate variables by year month df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE) head(df2)
-
pdb over 8 yearsIs it possible for the cbind to use dynamic variables?
-
pdb over 8 yearsIt's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
-
Clock Slave about 8 yearswhat if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
-
A5C1D2H2I1M1N2O1R2T1 about 8 years@ClockSlave, then you need to just use
.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though.... -
skan about 8 yearsWhat if I don't want two variables but two functions?. For example mean and sd.
-
Jaap almost 8 yearsThe
recast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
-
DatamineR almost 7 yearsIn the case of
NA
s this approach is really problematic. Settingna.rm = TRUE
does not affect anything and theNA
cases are ignored... -
lmo almost 7 years@andrie. The use of
.
in the formula interface mentioned recently in the comments is probably worth adding to the answer. -
theforestecologist about 6 yearsIs there a way to perform different functions (e.g.,
mean
,max
,min
,etc.) to each of the different variables incbind
? -
Bulat over 5 yearswhy not do this for data.table option:
dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
? -
Gregor Thomas almost 5 yearsThe question is about doing aggregations by group, but
describe
doesn't do anything by group... -
britt almost 5 years
describe.by(column, group = grouped_column)
will group the values -
Gregor Thomas almost 5 yearsWell, put that in the answer then! Don't hide it in a comment!
-
sjd almost 4 yearscan i use a vector of names in
cbind
.?cbind(x1,x2,x3)
works. Butx=c(x1,x2,x3); cbind(x)~
givesVariable lengths differ
error -
tjebo over 3 yearsthat is now on CRAN :) - however, there seem to be performance issues with using across - (which I can confirm on my data), here a related thread github.com/tidyverse/dplyr/issues/4953
-
Álvaro A. Gutiérrez-Vargas over 3 yearsFirst, this is a criminally underrated answer, thanks for it. Second, do you think there is possible to generate an
aggregate
function that computes themean
for a given set of variables andsum
for another set of variables? I am really curious about it, and it could save me a lot of time. Thx again.