Apply several summary functions on several variables by group in one call
Solution 1
You can do it all in one step and get proper labeling:
> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0
This creates a dataframe with two id columns and two matrix columns:
str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)
str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2
This is the syntax for multiple variables on the LHS:
aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
Solution 2
Given this in the question :
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
Then in data.table
(1.9.4+
) you could try :
> DT
id1 id2 val1 val2
1: a x 1 9
2: a x 2 4
3: a y 3 5
4: a y 4 9
5: b x 1 7
6: b y 4 4
7: b x 3 9
8: b y 2 8
> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)] # simplest
id1 id2 V1 V2 N
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)] # named
id1 id2 val1.m val2.m count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)] # mean over all columns
id1 id2 val1 val2 count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
For timings comparing aggregate
(used in question and all 3 other answers) to data.table
see
this benchmark (the agg
and agg.x
cases).
Solution 3
Using the dplyr
package you could achieve this by using summarise_all
. With this summarise-function you can apply other functions (in this case mean
and n()
) to each of the non-grouping columns:
x %>%
group_by(id1, id2) %>%
summarise_all(funs(mean, n()))
which gives:
id1 id2 val1_mean val2_mean val1_n val2_n
1 a x 1.5 6.5 2 2
2 a y 3.5 7.0 2 2
3 b x 2.0 8.0 2 2
4 b y 3.0 6.0 2 2
If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at()
function:
# inclusion
x %>%
group_by(id1, id2) %>%
summarise_at(vars(val1, val2), funs(mean, n()))
# exclusion
x %>%
group_by(id1, id2) %>%
summarise_at(vars(-val2), funs(mean, n()))
Solution 4
You could add a count
column, aggregate with sum
, then scale back to get the mean
:
x$count <- 1
agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
agg
# id1 id2 val1 val2 count
# 1 a x 3 13 2
# 2 b x 4 16 2
# 3 a y 7 14 2
# 4 b y 6 12 2
agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
agg
# id1 id2 val1 val2 count
# 1 a x 1.5 6.5 2
# 2 b x 2.0 8.0 2
# 3 a y 3.5 7.0 2
# 4 b y 3.0 6.0 2
It has the advantage of preserving your column names and creating a single count
column.
Solution 5
Perhaps you want to merge?
x.mean <- aggregate(. ~ id1+id2, p, mean)
x.len <- aggregate(. ~ id1+id2, p, length)
merge(x.mean, x.len, by = c("id1", "id2"))
id1 id2 val1.x val2.x val1.y val2.y
1 a x 1.5 6.5 2 2
2 a y 3.5 7.0 2 2
3 b x 2.0 8.0 2 2
4 b y 3.0 6.0 2 2
broccoli
Statistician,Machine Learning,Data Mining,R,Perl. I maintain a blog listing out probability puzzles with the intention of spreading a simplistic understanding of the science of probability.
Updated on December 31, 2020Comments
-
broccoli over 3 years
I have the following data frame
x <- read.table(text = " id1 id2 val1 val2 1 a x 1 9 2 a x 2 4 3 a y 3 5 4 a y 4 9 5 b x 1 7 6 b y 4 4 7 b x 3 9 8 b y 2 8", header = TRUE)
I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:
# calculate mean aggregate(. ~ id1 + id2, data = x, FUN = mean) # count rows aggregate(. ~ id1 + id2, data = x, FUN = length)
In order to do both calculations in one call, I tried
do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))
However, I get a garbled output along with a warning:
# m n # id1 1 2 # id2 1 1 # 1.5 2 # 2 2 # 3.5 2 # 3 2 # 6.5 2 # 8 2 # 7 2 # 6 2 # Warning message: # In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( : # number of columns of result is not a multiple of vector length (arg 1)
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
How can I use
aggregate
or other functions to perform several calculations in one call? -
broccoli almost 12 yearsThanks much. As a side note, how do I get aggregate to sum up just one column. If I have several numerical columns, I don't want it summing columns I don't want it to. I could of course throw away the columns after the aggregation is done, but the CPU cycles would already be spent then.
-
IRTFM almost 12 yearsYou only give it the factors to be grouped on and the columns to be aggregated. Possibly use negative column indexing in data or put the columns you want on the LHS of the formula. (See edit.)
-
JHowIX over 9 yearsI encountered the bug that user2659402 mentioned in his update while using RStudio 0.98.1014 on a windows 7 machine. If you output the data frame to the console as shown it appears normal, however if you save it into d, and then try to access d$val1.mn, it returns NULL. d also appears malformed if you run view(d). Using the code in the update fixed it.
-
IRTFM over 9 yearsThe reason you are having difficulty is that the "vals" are being returned as matrices with two columns each, rather than as ordinary columns. Try
d$val1[ , ""mn"]
and do look at the structure withstr
. -
lord.garbage over 9 yearsYou can bind the columns which contain matrices back into the data frame:
agg <- aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x)))
by usingagg_df <- do.call(data.frame, agg)
. See also here. -
BLT over 7 yearsAlso see the accepted answer here:stackoverflow.com/questions/32653428/….
-
IRTFM over 7 yearsNot sure what point this comment is delivering. Are you saying that question should be marked as a duplicate of this question?
-
BLT over 7 yearsNow that I reread, it's a duplicate of the UPDATE. I too had no success with my data until I used
object <- as.data.frame(as.list(aggregate(data.frame)))
. The explanation for why that is can be found in the link I posted. Now that I tried it with the data linked in this question, your answer works fine. Not sure what is unique about the datasets used by me and others experiencing this issue. -
IRTFM over 7 yearsThe UPDATE was wrong when it was written, and it's been wrong every time I have tested it in the years that followed. Perhaps the persons with this problem should spend some time doing better investigation and documentation with
str
anddput
. I used (implicitly)aggregate.formula
so it's possible that some issues with environments and search paths could have arisen, ... but NEED a MCVE. -
Tamas Ferenci about 6 yearsThis solution works perfectly if your functions return a single value. Things get complicated if they return a list, and you also have to extract parts of it (so that they're nicely arranged). Here is a solution for this case:
do.call( data.frame, aggregate( . ~ id1 + id2, data = x, FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) )
. (z.test
is from the libraryTeachingDemos
.) -
Tamas Ferenci about 6 yearsI used
with
to show what is the comfortable way if we also have to extract part of a returned value. -
Tamas Ferenci about 6 yearsAnother sidenote: if you want the output formed so that results for different variables appear below each other (i.e. in long format) the best is perhaps to melt the data frame beforehand:
do.call( data.frame, aggregate( value ~ variable + id1 + id2, data = melt( x, id.vars = c( "id1", "id2" ) ), FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) )
. -
Rafael over 5 yearsI think there is no an alternative in the answers and comments to apply a function to multiple variables, e.g. aggregate(cbind(numb, pot_id ) ~ year + project, data = dat_tot, FUN = function(x, y) sum(x)/length(y)), where numb = x and pot_id = y. Like as I wrote it the function doesn't work. Is something like that possible?