How to write a loop to run the t-test of a data frame?
Solution 1
Here's a simple solution, which doesn't require additional packages:
lapply(testData[-1], function(x) t.test(x ~ testData$Label))
Here testData[-1]
refers to all columns of testData
but the first one (which contains the labels). Negative indexing is used for excluding data.
Solution 2
You can use the formula interface to t.test
and use lapply
to iterate along the column names to build the formulae:
lapply(names(testData)[-1],function(x)
t.test(as.formula(paste(x,"Label",sep="~")),
data=testData))
[[1]]
Welch Two Sample t-test
data: F1 by Label
t = -3.6391, df = 13.969, p-value = 0.002691
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4519374 -0.1167204
sample estimates:
mean in group Bad mean in group Good
0.3776753 0.6620042
[[2]]
Welch Two Sample t-test
data: F2 by Label
t = 3.7358, df = 12.121, p-value = 0.002796
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.06997617 0.26529723
sample estimates:
mean in group Bad mean in group Good
0.8008989 0.6332622
...
Solution 3
I put the data in a long format , using reshape2
then I use your code but within a lapply
.
library(reshape2)
dat <- melt(testData)
lapply(unique(dat$variable),function(x){
Good <- subset(dat, Label == 'Good' & variable ==x)$value
Bad <- subset(dat, Label == 'Bad' & variable ==x)$value
t.test(Good,Bad)
})
Solution 4
This is pretty simple using something like lapply
, or ldply
from the plyr
package:
library(plyr)
cols_to_test <- c("F1", "F2", "F3")
results <- ldply(
cols_to_test,
function(colname) {
t_val = t.test(testData[[colname]] ~ testData$Label)$statistic
return(data.frame(colname=colname, t_value=t_val))
})
Which packages your results up neatly in a dataframe:
colname t_value
1 F1 -3.639136
2 F2 3.735834
3 F3 4.303688
Solution 5
This seemed like a common enough scenario and I was running to it constantly myself. For that purpose now there is a package called matrixTests
. So with that you could do a t-test on each column like so:
library(matrixTests)
goodMat <- testData[testData[,1]=="Good",-1]
badMat <- testData[testData[,1]=="Bad",-1]
result <- col_t_welch(goodMat, badMat)
Results for each column will be presented row-wise:
> result
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
F1 15 8 23 0.6620042 0.3776753 0.2843289 0.030422051 0.032610380 0.07813088 13.969401 3.639136 0.0026907550 0.11672039 0.45193741 two.sided 0 0.95
F2 15 8 23 0.6332622 0.8008989 -0.1676367 0.007950091 0.011868380 0.04487264 12.121463 -3.735834 0.0027964901 -0.26529723 -0.06997617 two.sided 0 0.95
F3 15 8 23 0.8256733 0.9401514 -0.1144781 0.006957733 0.001949685 0.02659999 20.988353 -4.303688 0.0003146478 -0.16979764 -0.05915849 two.sided 0 0.95
F4 15 8 23 0.8742631 0.6091331 0.2651299 0.009285928 0.027017832 0.06321622 9.639523 4.194017 0.0020007742 0.12355816 0.40670172 two.sided 0 0.95
F5 15 8 23 0.8164387 0.4908705 0.3255682 0.015196701 0.054574685 0.08851525 9.132819 3.678104 0.0049648236 0.12577586 0.52536063 two.sided 0 0.95
F6 15 8 23 0.4429950 0.1678080 0.2751870 0.055993325 0.021810423 0.08036909 20.281178 3.424040 0.0026474215 0.10768889 0.44268512 two.sided 0 0.95
F7 15 8 23 0.5450866 0.5606705 -0.0155839 0.005238405 0.035530607 0.06921382 8.119018 -0.225156 0.8274218437 -0.17478492 0.14361711 two.sided 0 0.95
F8 15 8 23 0.5328120 0.3734072 0.1594048 0.023064998 0.005458074 0.04711609 20.936316 3.383236 0.0028151348 0.06140341 0.25740625 two.sided 0 0.95
F9 15 8 23 0.4797677 0.2803339 0.1994337 0.027905214 0.002845209 0.04707440 18.511452 4.236565 0.0004696924 0.10072958 0.29813785 two.sided 0 0.95
F10 15 8 23 0.4961010 0.2865410 0.2095600 0.045493711 0.023072590 0.07692196 18.972832 2.724320 0.0134746988 0.04854491 0.37057514 two.sided 0 0.95
F11 15 8 23 0.4941480 0.3147666 0.1793814 0.025996108 0.001953517 0.04446643 17.527205 4.034086 0.0008157456 0.08577994 0.27298287 two.sided 0 0.95
For p-values there is a column named pvalue:
> result$pvalue
[1] 0.0026907550 0.0027964901 0.0003146478 0.0020007742 0.0049648236 0.0026474215 0.8274218437 0.0028151348 0.0004696924 0.0134746988 0.0008157456
Samo Jerom
Updated on July 09, 2022Comments
-
Samo Jerom almost 2 years
I met a problem of running a t-test for some data stored in a data frame. I know how to do it one by one but not efficient at all. May I ask how to write a loop to do it?
For example, I have got the data in the testData:
testData <- dput(testData) structure(list(Label = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("Bad", "Good"), class = "factor"), F1 = c(0.647789237, 0.546087915, 0.461342005, 0.794212207, 0.569199511, 0.735685704, 0.650942066, 0.457497016, 0.808619288, 0.673100668, 0.68781739, 0.470094549, 0.958591821, 1, 0.46908343, 0.578755283, 0.289380462, 0.685117658, 0.296011479, 0.208821225, 0.461487258, 0.176144907, 0.325684001), F2 = c(0.634327378, 0.602685034, 0.70643658, 0.577336318, 0.61069332, 0.676176013, 0.685433524, 0.601847779, 0.641738937, 0.822097452, 0.549508092, 0.711380436, 0.605492874, 0.419354439, 0.654424433, 0.782191133, 0.826394651, 0.63269692, 0.835389099, 0.760279322, 0.711607982, 1, 0.858631893), F3 = c(0.881115444, 0.850553659, 0.855405201, 0.732706141, 0.816063806, 0.841134018, 0.899594853, 0.788591779, 0.767461265, 0.954481259, 0.840970764, 0.897785959, 0.789288481, 0.604922471, 0.865024811, 0.947356946, 0.96622214, 0.879623595, 0.953189022, 0.960153373, 0.868949632, 1, 0.945716439), F4 = c(0.96939781, 0.758302, 0.652984943, 0.803719964, 0.980135127, 0.945287339, 0.84045753, 0.926053105, 0.974856922, 0.829936068, 0.89662815, 0.823594767, 1, 0.886954348, 0.825638185, 0.798524271, 0.524755093, 0.844685467, 0.522120663, 0.388604114, 0.725126521, 0.46430556, 0.604943457), F5 = c(0.908895247, 0.614799496, 0.529111461, 0.726753028, 0.942601677, 0.86641298, 0.75771251, 0.88237302, 1, 0.817706498, 0.834060845, 0.813550164, 0.927107922, 0.827680764, 0.797814872, 0.768118872, 0.271122929, 0.790632558, 0.391325631, 0.257446927, 0.687042673, 0.239520504, 0.521753545 ), F6 = c(0.589651031, 0.170481902, 0.137755423, 0.24453692, 0.505348067, 0.642589538, 0.308854104, 0.286913756, 0.60756673, 0.531315171, 0.389958915, 0.236113471, 1, 0.687877983, 0.305962183, 0.40469629, 0.08012222, 0.376774451, 0.098261016, 0.046544022, 0.201513755, 0.02085411, 0.113698232), F7 = c(0.460358642, 0.629499543, 0.598616653, 0.623674078, 0.526920757, 0.494086383, 0.504021253, 0.635105287, 0.558992452, 0.397770725, 0.543528957, 0.538542617, 0.646897446, 0.543646493, 0.47463817, 0.385081029, 0.555731206, 0.43769237, 0.501754893, 0.586155312, 0.496028109, 1, 0.522921361 ), F8 = c(0.523850222, 0.448936418, 0.339311791, 0.487421437, 0.462073661, 0.493421514, 0.464091025, 0.496938844, 0.5817454, 0.474404602, 0.720114482, 0.493098785, 1, 0.528538582, 0.478233718, 0.2695123, 0.362377901, 0.462252858, 0.287725327, 0.335584366, 0.397324649, 0.469082387, 0.403397835), F9 = c(0.481230473, 0.349419856, 0.309729777, 0.410783763, 0.465172146, 0.520935471, 0.380916463, 0.422238573, 0.572283353, 0.434705384, 0.512705279, 0.358892539, 1, 0.606926979, 0.370574926, 0.319739889, 0.249984729, 0.381053882, 0.245597953, 0.22883148, 0.314061676, 0.233511631, 0.269890359 ), F10 = c(0.592403628, 0.249811036, 0.256613757, 0.305839002, 0.497637944, 0.601946334, 0.401643991, 0.302626606, 0.623582766, 0.706254724, 0.435846561, 0.324357521, 1, 0.740362812, 0.402588813, 0.537414966, 0.216458806, 0.464852608, 0.251228269, 0.181500378, 0.31840514, 0.068594104, 0.253873772), F11 = c(0.490032261, 0.366486136, 0.336749996, 0.421899324, 0.479339762, 0.527364467, 0.398297911, 0.432190187, 0.584030586, 0.453666402, 0.526861753, 0.388880674, 1, 0.615835576, 0.39058525, 0.350811433, 0.290220147, 0.397424867, 0.288095106, 0.274852912, 0.340129804, 0.271099396, 0.305499273 )), .Names = c("Label", "F1", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10", "F11"), class = "data.frame", row.names = c(NA, -23L))
I need to run the t-test for each column with two independent groups, i.e., "Good" vs. "Bad" for several features "F1" to "F11". I tried to do something like:
GoodF1 <- subset(testData, Label == 'Good', select=c("F1")) BadF1 <- subset(testData, Label == 'Bad', select=c("F1")) t.test(GoodF1$F1,BadF1$F1)
And then do the rest of "F2" to "F11" but obviously not efficient. I really appreciate if you have better ideas to run it in a loop. Thanks very much.
-
Samo Jerom about 11 yearsNice method. Could you say a little bit more about testData[-1], what is [-1] here? Thanks.
-
Samo Jerom about 11 yearsSorry Sven, I am quite naive, still not quite sure. testData[-1] refers to all columns of testData but why we use negative indexing?
-
Sven Hohenstein about 11 years@SamoJerom The negative indexing here allows to exclude the first column.
-
Kory over 8 yearsWhat about for a two sample t-test?
-
Manasi Shah about 7 yearsHi @Sven, thanks for the solution! I was wondering what if there are additional variables in the testData frame? This answer is dependent on the fact that there is only one additional variable
Label
in the data frame. I have another variableBlock
which I want to use as the blocking variable for a hypothesis test eglapply(testData[-1], function(x) hypothesis.test(x ~ testData$Label | testData$Block)
The Block variable is at the end of the data frame. -
Sven Hohenstein about 7 years@ManasiShah You can try to exclude the last column:
lapply(testData[-c(1, ncol(testData)], function(x) hypothesis.test(x ~ testData$Label | testData$Block)
. -
Seanosapien almost 6 yearsNice package. Thanks.
-
Charlotte Jelleyman almost 5 yearsI have successfully used this code on my data. Can anyone tell me why the df are not (nA+nB)−2 as described here sthda.com/english/wiki/unpaired-two-samples-t-test-in-r?
-
James almost 5 years@CharlotteJelleyman By default,
var.equal=FALSE
so it uses the Welch approximation - the second version in the link you provided.