How to write a loop to run the t-test of a data frame?

20,487

Solution 1

Here's a simple solution, which doesn't require additional packages:

lapply(testData[-1], function(x) t.test(x ~ testData$Label))

Here testData[-1] refers to all columns of testData but the first one (which contains the labels). Negative indexing is used for excluding data.

Solution 2

You can use the formula interface to t.test and use lapply to iterate along the column names to build the formulae:

lapply(names(testData)[-1],function(x)
           t.test(as.formula(paste(x,"Label",sep="~")),
                  data=testData))

[[1]]

        Welch Two Sample t-test

data:  F1 by Label 
t = -3.6391, df = 13.969, p-value = 0.002691
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.4519374 -0.1167204 
sample estimates:
 mean in group Bad mean in group Good 
         0.3776753          0.6620042 


[[2]]

        Welch Two Sample t-test

data:  F2 by Label 
t = 3.7358, df = 12.121, p-value = 0.002796
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.06997617 0.26529723 
sample estimates:
 mean in group Bad mean in group Good 
         0.8008989          0.6332622 

...

Solution 3

I put the data in a long format , using reshape2 then I use your code but within a lapply.

library(reshape2)
dat <- melt(testData)
lapply(unique(dat$variable),function(x){
       Good <- subset(dat, Label  == 'Good' & variable ==x)$value
       Bad <- subset(dat, Label == 'Bad' & variable ==x)$value
       t.test(Good,Bad)
 })

Solution 4

This is pretty simple using something like lapply, or ldply from the plyr package:

library(plyr)
cols_to_test <- c("F1", "F2", "F3")
results <- ldply(
  cols_to_test,
  function(colname) {
    t_val = t.test(testData[[colname]] ~ testData$Label)$statistic
    return(data.frame(colname=colname, t_value=t_val))
    })

Which packages your results up neatly in a dataframe:

  colname   t_value
1      F1 -3.639136
2      F2  3.735834
3      F3  4.303688

Solution 5

This seemed like a common enough scenario and I was running to it constantly myself. For that purpose now there is a package called matrixTests. So with that you could do a t-test on each column like so:

library(matrixTests)

goodMat <- testData[testData[,1]=="Good",-1]
badMat  <- testData[testData[,1]=="Bad",-1]

result <- col_t_welch(goodMat, badMat)

Results for each column will be presented row-wise:

> result
    obs.x obs.y obs.tot    mean.x    mean.y  mean.diff       var.x       var.y     stderr        df statistic       pvalue    conf.low   conf.high alternative mean.null conf.level
F1     15     8      23 0.6620042 0.3776753  0.2843289 0.030422051 0.032610380 0.07813088 13.969401  3.639136 0.0026907550  0.11672039  0.45193741   two.sided         0       0.95
F2     15     8      23 0.6332622 0.8008989 -0.1676367 0.007950091 0.011868380 0.04487264 12.121463 -3.735834 0.0027964901 -0.26529723 -0.06997617   two.sided         0       0.95
F3     15     8      23 0.8256733 0.9401514 -0.1144781 0.006957733 0.001949685 0.02659999 20.988353 -4.303688 0.0003146478 -0.16979764 -0.05915849   two.sided         0       0.95
F4     15     8      23 0.8742631 0.6091331  0.2651299 0.009285928 0.027017832 0.06321622  9.639523  4.194017 0.0020007742  0.12355816  0.40670172   two.sided         0       0.95
F5     15     8      23 0.8164387 0.4908705  0.3255682 0.015196701 0.054574685 0.08851525  9.132819  3.678104 0.0049648236  0.12577586  0.52536063   two.sided         0       0.95
F6     15     8      23 0.4429950 0.1678080  0.2751870 0.055993325 0.021810423 0.08036909 20.281178  3.424040 0.0026474215  0.10768889  0.44268512   two.sided         0       0.95
F7     15     8      23 0.5450866 0.5606705 -0.0155839 0.005238405 0.035530607 0.06921382  8.119018 -0.225156 0.8274218437 -0.17478492  0.14361711   two.sided         0       0.95
F8     15     8      23 0.5328120 0.3734072  0.1594048 0.023064998 0.005458074 0.04711609 20.936316  3.383236 0.0028151348  0.06140341  0.25740625   two.sided         0       0.95
F9     15     8      23 0.4797677 0.2803339  0.1994337 0.027905214 0.002845209 0.04707440 18.511452  4.236565 0.0004696924  0.10072958  0.29813785   two.sided         0       0.95
F10    15     8      23 0.4961010 0.2865410  0.2095600 0.045493711 0.023072590 0.07692196 18.972832  2.724320 0.0134746988  0.04854491  0.37057514   two.sided         0       0.95
F11    15     8      23 0.4941480 0.3147666  0.1793814 0.025996108 0.001953517 0.04446643 17.527205  4.034086 0.0008157456  0.08577994  0.27298287   two.sided         0       0.95

For p-values there is a column named pvalue:

> result$pvalue
[1] 0.0026907550 0.0027964901 0.0003146478 0.0020007742 0.0049648236 0.0026474215 0.8274218437 0.0028151348 0.0004696924 0.0134746988 0.0008157456
Share:
20,487
Samo Jerom
Author by

Samo Jerom

Updated on July 09, 2022

Comments

  • Samo Jerom
    Samo Jerom almost 2 years

    I met a problem of running a t-test for some data stored in a data frame. I know how to do it one by one but not efficient at all. May I ask how to write a loop to do it?

    For example, I have got the data in the testData:

    testData <- dput(testData)
    structure(list(Label = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = c("Bad", "Good"), class = "factor"), F1 = c(0.647789237, 
    0.546087915, 0.461342005, 0.794212207, 0.569199511, 0.735685704, 
    0.650942066, 0.457497016, 0.808619288, 0.673100668, 0.68781739, 
    0.470094549, 0.958591821, 1, 0.46908343, 0.578755283, 0.289380462, 
    0.685117658, 0.296011479, 0.208821225, 0.461487258, 0.176144907, 
    0.325684001), F2 = c(0.634327378, 0.602685034, 0.70643658, 0.577336318, 
    0.61069332, 0.676176013, 0.685433524, 0.601847779, 0.641738937, 
    0.822097452, 0.549508092, 0.711380436, 0.605492874, 0.419354439, 
    0.654424433, 0.782191133, 0.826394651, 0.63269692, 0.835389099, 
    0.760279322, 0.711607982, 1, 0.858631893), F3 = c(0.881115444, 
    0.850553659, 0.855405201, 0.732706141, 0.816063806, 0.841134018, 
    0.899594853, 0.788591779, 0.767461265, 0.954481259, 0.840970764, 
    0.897785959, 0.789288481, 0.604922471, 0.865024811, 0.947356946, 
    0.96622214, 0.879623595, 0.953189022, 0.960153373, 0.868949632, 
    1, 0.945716439), F4 = c(0.96939781, 0.758302, 0.652984943, 0.803719964, 
    0.980135127, 0.945287339, 0.84045753, 0.926053105, 0.974856922, 
    0.829936068, 0.89662815, 0.823594767, 1, 0.886954348, 0.825638185, 
    0.798524271, 0.524755093, 0.844685467, 0.522120663, 0.388604114, 
    0.725126521, 0.46430556, 0.604943457), F5 = c(0.908895247, 0.614799496, 
    0.529111461, 0.726753028, 0.942601677, 0.86641298, 0.75771251, 
    0.88237302, 1, 0.817706498, 0.834060845, 0.813550164, 0.927107922, 
    0.827680764, 0.797814872, 0.768118872, 0.271122929, 0.790632558, 
    0.391325631, 0.257446927, 0.687042673, 0.239520504, 0.521753545
    ), F6 = c(0.589651031, 0.170481902, 0.137755423, 0.24453692, 
    0.505348067, 0.642589538, 0.308854104, 0.286913756, 0.60756673, 
    0.531315171, 0.389958915, 0.236113471, 1, 0.687877983, 0.305962183, 
    0.40469629, 0.08012222, 0.376774451, 0.098261016, 0.046544022, 
    0.201513755, 0.02085411, 0.113698232), F7 = c(0.460358642, 0.629499543, 
    0.598616653, 0.623674078, 0.526920757, 0.494086383, 0.504021253, 
    0.635105287, 0.558992452, 0.397770725, 0.543528957, 0.538542617, 
    0.646897446, 0.543646493, 0.47463817, 0.385081029, 0.555731206, 
    0.43769237, 0.501754893, 0.586155312, 0.496028109, 1, 0.522921361
    ), F8 = c(0.523850222, 0.448936418, 0.339311791, 0.487421437, 
    0.462073661, 0.493421514, 0.464091025, 0.496938844, 0.5817454, 
    0.474404602, 0.720114482, 0.493098785, 1, 0.528538582, 0.478233718, 
    0.2695123, 0.362377901, 0.462252858, 0.287725327, 0.335584366, 
    0.397324649, 0.469082387, 0.403397835), F9 = c(0.481230473, 0.349419856, 
    0.309729777, 0.410783763, 0.465172146, 0.520935471, 0.380916463, 
    0.422238573, 0.572283353, 0.434705384, 0.512705279, 0.358892539, 
    1, 0.606926979, 0.370574926, 0.319739889, 0.249984729, 0.381053882, 
    0.245597953, 0.22883148, 0.314061676, 0.233511631, 0.269890359
    ), F10 = c(0.592403628, 0.249811036, 0.256613757, 0.305839002, 
    0.497637944, 0.601946334, 0.401643991, 0.302626606, 0.623582766, 
    0.706254724, 0.435846561, 0.324357521, 1, 0.740362812, 0.402588813, 
    0.537414966, 0.216458806, 0.464852608, 0.251228269, 0.181500378, 
    0.31840514, 0.068594104, 0.253873772), F11 = c(0.490032261, 0.366486136, 
    0.336749996, 0.421899324, 0.479339762, 0.527364467, 0.398297911, 
    0.432190187, 0.584030586, 0.453666402, 0.526861753, 0.388880674, 
    1, 0.615835576, 0.39058525, 0.350811433, 0.290220147, 0.397424867, 
    0.288095106, 0.274852912, 0.340129804, 0.271099396, 0.305499273
    )), .Names = c("Label", "F1", "F2", "F3", "F4", "F5", "F6", "F7", 
    "F8", "F9", "F10", "F11"), class = "data.frame", row.names = c(NA, 
    -23L))
    

    I need to run the t-test for each column with two independent groups, i.e., "Good" vs. "Bad" for several features "F1" to "F11". I tried to do something like:

    GoodF1 <- subset(testData, Label == 'Good', select=c("F1"))
    BadF1  <- subset(testData, Label == 'Bad', select=c("F1"))
    t.test(GoodF1$F1,BadF1$F1)
    

    And then do the rest of "F2" to "F11" but obviously not efficient. I really appreciate if you have better ideas to run it in a loop. Thanks very much.

  • Samo Jerom
    Samo Jerom about 11 years
    Nice method. Could you say a little bit more about testData[-1], what is [-1] here? Thanks.
  • Samo Jerom
    Samo Jerom about 11 years
    Sorry Sven, I am quite naive, still not quite sure. testData[-1] refers to all columns of testData but why we use negative indexing?
  • Sven Hohenstein
    Sven Hohenstein about 11 years
    @SamoJerom The negative indexing here allows to exclude the first column.
  • Kory
    Kory over 8 years
    What about for a two sample t-test?
  • Manasi Shah
    Manasi Shah about 7 years
    Hi @Sven, thanks for the solution! I was wondering what if there are additional variables in the testData frame? This answer is dependent on the fact that there is only one additional variable Label in the data frame. I have another variable Block which I want to use as the blocking variable for a hypothesis test eg lapply(testData[-1], function(x) hypothesis.test(x ~ testData$Label | testData$Block) The Block variable is at the end of the data frame.
  • Sven Hohenstein
    Sven Hohenstein about 7 years
    @ManasiShah You can try to exclude the last column: lapply(testData[-c(1, ncol(testData)], function(x) hypothesis.test(x ~ testData$Label | testData$Block).
  • Seanosapien
    Seanosapien almost 6 years
    Nice package. Thanks.
  • Charlotte Jelleyman
    Charlotte Jelleyman almost 5 years
    I have successfully used this code on my data. Can anyone tell me why the df are not (nA+nB)−2 as described here sthda.com/english/wiki/unpaired-two-samples-t-test-in-r?
  • James
    James almost 5 years
    @CharlotteJelleyman By default, var.equal=FALSE so it uses the Welch approximation - the second version in the link you provided.