sum multiple columns by group with tapply
Solution 1
tapply
works on a vector, for a data.frame you can use by
(which is a wrapper for tapply
, take a look at the code):
> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
apples cherries plums
111 222 333
-------------------------------------------------------------------------------------
df.1$state: BB
apples cherries plums
-111 -222 -333
Solution 2
You're looking for by
. It uses the INDEX
in the way that you assumed tapply
would, by row.
by(df.1, df.1$state, function(x) colSums(x[,3:5]))
The problem with your use of tapply
is that you were indexing the data.frame
by column. (Because data.frame
is really just a list
of columns.) So, tapply
complained that your index didn't match the length of your data.frame
which is 5.
Related videos on Youtube
Mark Miller
Mostly use R, JAGS, MARK, and some C My GitHub page: https://github.com/IcterusGalbula
Updated on July 09, 2022Comments
-
Mark Miller almost 2 years
I wanted to sum individual columns by group and my first thought was to use
tapply
. However, I cannot gettapply
to work. Cantapply
be used to sum multiple columns? If not, why not?I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.
Below is an example data set for which I wish to sum apples by state, cherries by state and plums by state. Below that I have compiled numerous alternatives to
tapply
that do work.At the bottom I show a simple modification to the
tapply
source code that allowstapply
to perform the desired operation.Nevertheless, perhaps I am overlooking a simple way to perform the desired operation with
tapply
. I am not looking for alternative functions, although additional alternatives are welcome.Given the simplicity of my modification to the
tapply
source code I wonder why it, or something similar, has not already been implemented.Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.
Here is the example data set:
df.1 <- read.table(text = ' state county apples cherries plums AA 1 1 2 3 AA 2 10 20 30 AA 3 100 200 300 BB 7 -1 -2 -3 BB 8 -10 -20 -30 BB 9 -100 -200 -300 ', header = TRUE, stringsAsFactors = FALSE)
This does not work:
tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})
The help pages says:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) X an atomic object, typically a vector.
I was confused by the phrase
typically a vector
which made me wonder whether a data frame could be used. I have never been clear on whatatomic object
means.Here are several alternatives to
tapply
that do work. The first alternative is a work-around that combinestapply
withapply
.apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum)) # apples cherries plums # AA 111 222 333 # BB -111 -222 -333 with(df.1, aggregate(df.1[,3:5], data.frame(state), sum)) # state apples cherries plums # 1 AA 111 222 333 # 2 BB -111 -222 -333 t(sapply(split(df.1[,3:5], df.1$state), colSums)) # apples cherries plums # AA 111 222 333 # BB -111 -222 -333 t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum))) # apples cherries plums # AA 111 222 333 # BB -111 -222 -333 aggregate(df.1[,3:5], by=list(df.1$state), sum) # Group.1 apples cherries plums # 1 AA 111 222 333 # 2 BB -111 -222 -333 by(df.1[,3:5], df.1$state, colSums) # df.1$state: AA # apples cherries plums # 111 222 333 # ------------------------------------------------------------ # df.1$state: BB # apples cherries plums # -111 -222 -333 with(df.1, aggregate(x = list(apples = apples, cherries = cherries, plums = plums), by = list(state = state), FUN = function(x) sum(x))) # state apples cherries plums # 1 AA 111 222 333 # 2 BB -111 -222 -333 lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} ) # $AA # apples cherries plums # 111 222 333 # # $BB # apples cherries plums # -111 -222 -333
Here is the source code for
tapply
except that I changed the line:nx <- length(X)
to:
nx <- ifelse(is.vector(X), length(X), dim(X)[1])
This modified version of
tapply
performs the desired operation:my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) { FUN <- if (!is.null(FUN)) match.fun(FUN) if (!is.list(INDEX)) INDEX <- list(INDEX) nI <- length(INDEX) if (!nI) stop("'INDEX' is of length zero") namelist <- vector("list", nI) names(namelist) <- names(INDEX) extent <- integer(nI) nx <- ifelse(is.vector(X), length(X), dim(X)[1]) # replaces nx <- length(X) one <- 1L group <- rep.int(one, nx) #- to contain the splitting vector ngroup <- one for (i in seq_along(INDEX)) { index <- as.factor(INDEX[[i]]) if (length(index) != nx) stop("arguments must have same length") namelist[[i]] <- levels(index)#- all of them, yes ! extent[i] <- nlevels(index) group <- group + ngroup * (as.integer(index) - one) ngroup <- ngroup * nlevels(index) } if (is.null(FUN)) return(group) ans <- lapply(X = split(X, group), FUN = FUN, ...) index <- as.integer(names(ans)) if (simplify && all(unlist(lapply(ans, length)) == 1L)) { ansmat <- array(dim = extent, dimnames = namelist) ans <- unlist(ans, recursive = FALSE) } else { ansmat <- array(vector("list", prod(extent)), dim = extent, dimnames = namelist) } if(length(index)) { names(ans) <- NULL ansmat[index] <- ans } ansmat } my.tapply(df.1$apples, df.1$state, function(x) {sum(x)}) # AA BB # 111 -111 my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)}) # $AA # apples cherries # 111 222 # # $BB # apples cherries # -111 -222