sum multiple columns by group with tapply

29,670

Solution 1

tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):

> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
  apples cherries    plums 
     111      222      333 
------------------------------------------------------------------------------------- 
df.1$state: BB
  apples cherries    plums 
    -111     -222     -333 

Solution 2

You're looking for by. It uses the INDEX in the way that you assumed tapply would, by row.

by(df.1, df.1$state, function(x) colSums(x[,3:5]))

The problem with your use of tapply is that you were indexing the data.frame by column. (Because data.frame is really just a list of columns.) So, tapply complained that your index didn't match the length of your data.frame which is 5.

Share:
29,670

Related videos on Youtube

Mark Miller
Author by

Mark Miller

Mostly use R, JAGS, MARK, and some C My GitHub page: https://github.com/IcterusGalbula

Updated on July 09, 2022

Comments

  • Mark Miller
    Mark Miller almost 2 years

    I wanted to sum individual columns by group and my first thought was to use tapply. However, I cannot get tapply to work. Can tapply be used to sum multiple columns? If not, why not?

    I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.

    Below is an example data set for which I wish to sum apples by state, cherries by state and plums by state. Below that I have compiled numerous alternatives to tapply that do work.

    At the bottom I show a simple modification to the tapply source code that allows tapply to perform the desired operation.

    Nevertheless, perhaps I am overlooking a simple way to perform the desired operation with tapply. I am not looking for alternative functions, although additional alternatives are welcome.

    Given the simplicity of my modification to the tapply source code I wonder why it, or something similar, has not already been implemented.

    Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.

    Here is the example data set:

    df.1 <- read.table(text = '
    
        state   county   apples   cherries   plums
           AA        1        1          2       3
           AA        2       10         20      30
           AA        3      100        200     300
           BB        7       -1         -2      -3
           BB        8      -10        -20     -30
           BB        9     -100       -200    -300
    
    ', header = TRUE, stringsAsFactors = FALSE)
    

    This does not work:

    tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})
    

    The help pages says:

    tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
    
    X       an atomic object, typically a vector.
    

    I was confused by the phrase typically a vector which made me wonder whether a data frame could be used. I have never been clear on what atomic object means.

    Here are several alternatives to tapply that do work. The first alternative is a work-around that combines tapply with apply.

    apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))
    
    #    apples cherries plums
    # AA    111      222   333
    # BB   -111     -222  -333
    
    with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))
    
    #   state apples cherries plums
    # 1    AA    111      222   333
    # 2    BB   -111     -222  -333
    
    t(sapply(split(df.1[,3:5], df.1$state), colSums))
    
    #    apples cherries plums
    # AA    111      222   333
    # BB   -111     -222  -333
    
    t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))
    
    #    apples cherries plums
    # AA    111      222   333
    # BB   -111     -222  -333
    
    aggregate(df.1[,3:5], by=list(df.1$state), sum)
    
    #   Group.1 apples cherries plums
    # 1      AA    111      222   333
    # 2      BB   -111     -222  -333
    
    by(df.1[,3:5], df.1$state, colSums)
    
    # df.1$state: AA
    #   apples cherries    plums 
    #      111      222      333 
    # ------------------------------------------------------------ 
    # df.1$state: BB
    #   apples cherries    plums 
    #     -111     -222     -333
    
    with(df.1, 
         aggregate(x = list(apples   = apples, 
                            cherries = cherries,
                            plums    = plums), 
                   by = list(state   = state), 
                   FUN = function(x) sum(x)))
    
    #   state apples cherries plums
    # 1    AA    111      222   333
    # 2    BB   -111     -222  -333
    
    lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )
    
    # $AA
    #   apples cherries    plums 
    #      111      222      333 
    #
    # $BB
    #   apples cherries    plums 
    #     -111     -222     -333
    

    Here is the source code for tapply except that I changed the line:

    nx <- length(X)
    

    to:

    nx <- ifelse(is.vector(X), length(X), dim(X)[1])
    

    This modified version of tapply performs the desired operation:

    my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
    {
        FUN <- if (!is.null(FUN)) match.fun(FUN)
        if (!is.list(INDEX)) INDEX <- list(INDEX)
        nI <- length(INDEX)
        if (!nI) stop("'INDEX' is of length zero")
        namelist <- vector("list", nI)
        names(namelist) <- names(INDEX)
        extent <- integer(nI)
        nx     <- ifelse(is.vector(X), length(X), dim(X)[1])  # replaces nx <- length(X)
        one <- 1L
        group <- rep.int(one, nx) #- to contain the splitting vector
        ngroup <- one
        for (i in seq_along(INDEX)) {
        index <- as.factor(INDEX[[i]])
        if (length(index) != nx)
            stop("arguments must have same length")
        namelist[[i]] <- levels(index)#- all of them, yes !
        extent[i] <- nlevels(index)
        group <- group + ngroup * (as.integer(index) - one)
        ngroup <- ngroup * nlevels(index)
        }
        if (is.null(FUN)) return(group)
        ans <- lapply(X = split(X, group), FUN = FUN, ...)
        index <- as.integer(names(ans))
        if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
        ansmat <- array(dim = extent, dimnames = namelist)
        ans <- unlist(ans, recursive = FALSE)
        } else {
        ansmat <- array(vector("list", prod(extent)),
                dim = extent, dimnames = namelist)
        }
        if(length(index)) {
            names(ans) <- NULL
            ansmat[index] <- ans
        }
        ansmat
    }
    
    my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})
    
    #  AA   BB 
    # 111 -111
    
    my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})
    
    # $AA
    #   apples cherries 
    #      111      222 
    #
    # $BB
    #   apples cherries 
    #     -111     -222
    

Related