mean and standard deviation by group for multiple variables
Solution 1
The function you will likely want to apply to your dataframe is aggregate()
with either mean
or sd
as the function parameter.
Solution 2
assuming myDF
is your original dataset:
library(data.table)
myDT <- data.table(myDF)
# Which variables to calculate All columns but the first five? :
variables <- tail( names(myDT), -5)
myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]
## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
--
> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR ZT 16.66667 17.00000 1.637000 1.690667 1.663667
> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984
Solution 3
aggregate(g[, c("sand", "silt", "clay")], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Using an anonymous function with aggregate.data.frame
allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:
aggregate(g[, names(g)[-(1:4)], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
user2296772
Updated on June 13, 2022Comments
-
user2296772 almost 2 years
I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.
This is a portion of my dataframe (g):
trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave 1 CTK 1 CT K 74 15 11 1.323 1.593 1.458 2 CTK 2 CT K 71 15 14 1.575 1.601 1.588 3 CTK 3 CT K 72 14 14 1.551 1.594 1.573 4 CTR 1 CT R 72 15 13 1.560 1.647 1.604 5 CTR 2 CT R 73 14 13 1.612 1.580 1.596 6 CTR 3 CT R 73 13 14 1.709 1.577 1.643 7 ZTK 1 ZT K 72 16 12 1.526 1.546 1.536 8 ZTK 2 ZT K 71 16 13 1.292 1.626 1.459 9 ZTK 3 ZT K 71 17 12 1.623 1.607 1.615 10 ZTR 1 ZT R 66 16 18 1.719 1.709 1.714 11 ZTR 2 ZT R 67 17 16 1.529 1.708 1.618 12 ZTR 3 ZT R 66 17 17 1.663 1.655 1.659
I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!
-
user2296772 about 11 yearsThanks all! I'm new to this site and glad to see some patience from more experienced users.