mean and standard deviation by group for multiple variables

15,085

Solution 1

The function you will likely want to apply to your dataframe is aggregate() with either mean or sd as the function parameter.

Solution 2

assuming myDF is your original dataset:

library(data.table)
myDT <- data.table(myDF)

# Which variables to calculate  All columns but the first five? : 
variables <- tail( names(myDT), -5)

myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]


## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd),   .SDcols=variables, by=list(trt, til)]

--

> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
#    trt til     silt     clay   ibd1_6  ibd9_14  ibd_ave
# 1: CTK  CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR  CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK  ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR  ZT 16.66667 17.00000 1.637000 1.690667 1.663667

> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
#    trt til      silt      clay     ibd1_6     ibd9_14    ibd_ave
# 1: CTK  CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR  CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK  ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR  ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984

Solution 3

aggregate(g[, c("sand", "silt", "clay")],  g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )

Using an anonymous function with aggregate.data.frame allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:

aggregate(g[, names(g)[-(1:4)],  g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Share:
15,085
user2296772
Author by

user2296772

Updated on June 13, 2022

Comments

  • user2296772
    user2296772 almost 2 years

    I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.

    This is a portion of my dataframe (g):

       trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave
    1  CTK   1  CT   K   74   15   11  1.323   1.593   1.458
    2  CTK   2  CT   K   71   15   14  1.575   1.601   1.588
    3  CTK   3  CT   K   72   14   14  1.551   1.594   1.573
    4  CTR   1  CT   R   72   15   13  1.560   1.647   1.604
    5  CTR   2  CT   R   73   14   13  1.612   1.580   1.596
    6  CTR   3  CT   R   73   13   14  1.709   1.577   1.643
    7  ZTK   1  ZT   K   72   16   12  1.526   1.546   1.536
    8  ZTK   2  ZT   K   71   16   13  1.292   1.626   1.459
    9  ZTK   3  ZT   K   71   17   12  1.623   1.607   1.615
    10 ZTR   1  ZT   R   66   16   18  1.719   1.709   1.714
    11 ZTR   2  ZT   R   67   17   16  1.529   1.708   1.618
    12 ZTR   3  ZT   R   66   17   17  1.663   1.655   1.659 
    

    I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!

  • user2296772
    user2296772 about 11 years
    Thanks all! I'm new to this site and glad to see some patience from more experienced users.