Apply several summary functions on several variables by group in one call

107,674

Solution 1

You can do it all in one step and get proper labeling:

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
#   id1 id2 val1.mn val1.n val2.mn val2.n
# 1   a   x     1.5    2.0     6.5    2.0
# 2   b   x     2.0    2.0     8.0    2.0
# 3   a   y     3.5    2.0     7.0    2.0
# 4   b   y     3.0    2.0     6.0    2.0

This creates a dataframe with two id columns and two matrix columns:

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame':   4 obs. of  4 variables:
 $ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"
 $ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 
    )
'data.frame':   4 obs. of  6 variables:
 $ id1    : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2    : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1.mn: num  1.5 2 3.5 3
 $ val1.n : num  2 2 2 2
 $ val2.mn: num  6.5 8 7 6
 $ val2.n : num  2 2 2 2

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )

Solution 2

Given this in the question :

I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

Then in data.table (1.9.4+) you could try :

> DT
   id1 id2 val1 val2
1:   a   x    1    9
2:   a   x    2    4
3:   a   y    3    5
4:   a   y    4    9
5:   b   x    1    7
6:   b   y    4    4
7:   b   x    3    9
8:   b   y    2    8

> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)]   # simplest
   id1 id2  V1  V2 N
1:   a   x 1.5 6.5 2
2:   a   y 3.5 7.0 2
3:   b   x 2.0 8.0 2
4:   b   y 3.0 6.0 2

> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)]  # named
   id1 id2 val1.m val2.m count
1:   a   x    1.5    6.5     2
2:   a   y    3.5    7.0     2
3:   b   x    2.0    8.0     2
4:   b   y    3.0    6.0     2

> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)]   # mean over all columns
   id1 id2 val1 val2 count
1:   a   x  1.5  6.5     2
2:   a   y  3.5  7.0     2
3:   b   x  2.0  8.0     2
4:   b   y  3.0  6.0     2

For timings comparing aggregate (used in question and all 3 other answers) to data.table see this benchmark (the agg and agg.x cases).

Solution 3

Using the dplyr package you could achieve this by using summarise_all. With this summarise-function you can apply other functions (in this case mean and n()) to each of the non-grouping columns:

x %>%
  group_by(id1, id2) %>%
  summarise_all(funs(mean, n()))

which gives:

     id1    id2 val1_mean val2_mean val1_n val2_n
1      a      x       1.5       6.5      2      2
2      a      y       3.5       7.0      2      2
3      b      x       2.0       8.0      2      2
4      b      y       3.0       6.0      2      2

If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at() function:

# inclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(val1, val2), funs(mean, n()))

# exclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(-val2), funs(mean, n()))

Solution 4

You could add a count column, aggregate with sum, then scale back to get the mean:

x$count <- 1
agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
agg
#   id1 id2 val1 val2 count
# 1   a   x    3   13     2
# 2   b   x    4   16     2
# 3   a   y    7   14     2
# 4   b   y    6   12     2

agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
agg
#   id1 id2 val1 val2 count
# 1   a   x  1.5  6.5     2
# 2   b   x  2.0  8.0     2
# 3   a   y  3.5  7.0     2
# 4   b   y  3.0  6.0     2

It has the advantage of preserving your column names and creating a single count column.

Solution 5

Perhaps you want to merge?

x.mean <- aggregate(. ~ id1+id2, p, mean)
x.len  <- aggregate(. ~ id1+id2, p, length)

merge(x.mean, x.len, by = c("id1", "id2"))

  id1 id2 val1.x val2.x val1.y val2.y
1   a   x    1.5    6.5      2      2
2   a   y    3.5    7.0      2      2
3   b   x    2.0    8.0      2      2
4   b   y    3.0    6.0      2      2
Share:
107,674
broccoli
Author by

broccoli

Statistician,Machine Learning,Data Mining,R,Perl. I maintain a blog listing out probability puzzles with the intention of spreading a simplistic understanding of the science of probability.

Updated on December 31, 2020

Comments

  • broccoli
    broccoli over 3 years

    I have the following data frame

    x <- read.table(text = "  id1 id2 val1 val2
    1   a   x    1    9
    2   a   x    2    4
    3   a   y    3    5
    4   a   y    4    9
    5   b   x    1    7
    6   b   y    4    4
    7   b   x    3    9
    8   b   y    2    8", header = TRUE)
    

    I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:

    # calculate mean
    aggregate(. ~ id1 + id2, data = x, FUN = mean)
    
    # count rows
    aggregate(. ~ id1 + id2, data = x, FUN = length)
    

    In order to do both calculations in one call, I tried

    do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))
    

    However, I get a garbled output along with a warning:

    #     m   n
    # id1 1   2
    # id2 1   1
    #     1.5 2
    #     2   2
    #     3.5 2
    #     3   2
    #     6.5 2
    #     8   2
    #     7   2
    #     6   2
    # Warning message:
    #   In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( :
    #   number of columns of result is not a multiple of vector length (arg 1)
    

    I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

    How can I use aggregate or other functions to perform several calculations in one call?

  • broccoli
    broccoli almost 12 years
    Thanks much. As a side note, how do I get aggregate to sum up just one column. If I have several numerical columns, I don't want it summing columns I don't want it to. I could of course throw away the columns after the aggregation is done, but the CPU cycles would already be spent then.
  • IRTFM
    IRTFM almost 12 years
    You only give it the factors to be grouped on and the columns to be aggregated. Possibly use negative column indexing in data or put the columns you want on the LHS of the formula. (See edit.)
  • JHowIX
    JHowIX over 9 years
    I encountered the bug that user2659402 mentioned in his update while using RStudio 0.98.1014 on a windows 7 machine. If you output the data frame to the console as shown it appears normal, however if you save it into d, and then try to access d$val1.mn, it returns NULL. d also appears malformed if you run view(d). Using the code in the update fixed it.
  • IRTFM
    IRTFM over 9 years
    The reason you are having difficulty is that the "vals" are being returned as matrices with two columns each, rather than as ordinary columns. Try d$val1[ , ""mn"] and do look at the structure with str.
  • lord.garbage
    lord.garbage over 9 years
    You can bind the columns which contain matrices back into the data frame: agg <- aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x))) by using agg_df <- do.call(data.frame, agg). See also here.
  • BLT
    BLT over 7 years
    Also see the accepted answer here:stackoverflow.com/questions/32653428/….
  • IRTFM
    IRTFM over 7 years
    Not sure what point this comment is delivering. Are you saying that question should be marked as a duplicate of this question?
  • BLT
    BLT over 7 years
    Now that I reread, it's a duplicate of the UPDATE. I too had no success with my data until I used object <- as.data.frame(as.list(aggregate(data.frame))). The explanation for why that is can be found in the link I posted. Now that I tried it with the data linked in this question, your answer works fine. Not sure what is unique about the datasets used by me and others experiencing this issue.
  • IRTFM
    IRTFM over 7 years
    The UPDATE was wrong when it was written, and it's been wrong every time I have tested it in the years that followed. Perhaps the persons with this problem should spend some time doing better investigation and documentation with str and dput. I used (implicitly) aggregate.formula so it's possible that some issues with environments and search paths could have arisen, ... but NEED a MCVE.
  • Tamas Ferenci
    Tamas Ferenci about 6 years
    This solution works perfectly if your functions return a single value. Things get complicated if they return a list, and you also have to extract parts of it (so that they're nicely arranged). Here is a solution for this case:do.call( data.frame, aggregate( . ~ id1 + id2, data = x, FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) ). (z.test is from the library TeachingDemos.)
  • Tamas Ferenci
    Tamas Ferenci about 6 years
    I used with to show what is the comfortable way if we also have to extract part of a returned value.
  • Tamas Ferenci
    Tamas Ferenci about 6 years
    Another sidenote: if you want the output formed so that results for different variables appear below each other (i.e. in long format) the best is perhaps to melt the data frame beforehand: do.call( data.frame, aggregate( value ~ variable + id1 + id2, data = melt( x, id.vars = c( "id1", "id2" ) ), FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) ).
  • Rafael
    Rafael over 5 years
    I think there is no an alternative in the answers and comments to apply a function to multiple variables, e.g. aggregate(cbind(numb, pot_id ) ~ year + project, data = dat_tot, FUN = function(x, y) sum(x)/length(y)), where numb = x and pot_id = y. Like as I wrote it the function doesn't work. Is something like that possible?