Using shapiro.test on multiple columns in a data frame
Solution 1
Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply()
or lapply()
(or even apply()
, but for data frames, one of the two earlier-mentioned functions would be best).
Here is an example, using some dummy data:
set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2),
Uniform = runif(50))
Now apply the shapiro.test()
function. We capture the output in a list (given the object returned by this function) so we will use lapply()
.
lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results
R> lshap[[1]]
Shapiro-Wilk normality test
data: X[[1L]]
W = 0.9802, p-value = 0.5611
You will need to extract the things you want from these objects, which all have the structure:
R> str(lshap[[1]])
List of 4
$ statistic: Named num 0.98
..- attr(*, "names")= chr "W"
$ p.value : num 0.561
$ method : chr "Shapiro-Wilk normality test"
$ data.name: chr "X[[1L]]"
- attr(*, "class")= chr "htest"
If you want the statistic
and p.value
components of this object for all elements of lshap
, we will use sapply()
this time, to nicely arrange the results for us:
lres <- sapply(lshap, `[`, c("statistic","p.value"))
R> lres
Gaussian Poisson Uniform
statistic 0.9802 0.9371 0.918
p.value 0.5611 0.01034 0.001998
Given that you have 500 of these, I'd transpose lres
:
R> t(lres)
statistic p.value
Gaussian 0.9802 0.5611
Poisson 0.9371 0.01034
Uniform 0.918 0.001998
If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.
Solution 2
To apply some function over rows or columns of a data frame, one uses apply
family:
df <- data.frame(a=rnorm(100), b=rnorm(100))
df.shapiro <- apply(df, 2, shapiro.test)
df.shapiro
$a
Shapiro-Wilk normality test
data: newX[, i]
W = 0.9895, p-value = 0.6276
$b
Shapiro-Wilk normality test
data: newX[, i]
W = 0.9854, p-value = 0.3371
Note that column names are preserved, and df.shapiro
is a named list.
Now, if you want, say, a vector of p-values, all you have to do is to extract them from appropriate lists:
unlist(lapply(df.shapiro, function(x) x$p.value))
a b
0.6275521 0.3370931
Solution 3
Use do.call
with rbind
and lapply
for more simple and compact solution:
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
#> statistic p.value
#> a 0.986224 0.3875904
#> b 0.9894938 0.6238027
#> c 0.9652532 0.009694794
Seb Matamoros
Microbiologist, molecular biologist, bio-informatician and bio-statistician. Interested mostly in microbiome studies.
Updated on July 05, 2022Comments
-
Seb Matamoros almost 2 years
I have a dataframe (let's call it
df
), containing n=100 columns (C1
,C2
,...,C100
) and 50 rows (R1
,R2
,...,R50
). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each column has a normal distribution using theshapiro.test()
function.I am able to do it column by colums using the code :
> shapiro.test(df$Cn)
or
> shapiro.test(df[,c(Cn)])
However, when I try to do it on several columns at the same time it doesn't work :
> shapiro.test(df[,c(C1:C100)])
returns the error :
Error in
[.data.frame
(x, complete.cases(x)) : undefined columns selectedI would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.
-
tonytonov over 10 yearsThe final note is brilliant.
-
Seb Matamoros over 10 yearsThanks, it works perfectly. As to what to do with it, well... I need to do multiple correlations between the different columns of this matrix. I would do non-parametric correlations, but my boss is allergic to non-parametric and insists on parametric. I'll see if it's possible to transform the data to obtain normal distribution...
-
Gavin Simpson over 10 yearsWhatever you do, you need to correct for doing all these tests. If you did 100 tests and used the usual 0.05 (alpha=0.95) significance level then you are accepting that you'll reject the NULL (H0) on average 5 times in 100 when HO is correct (i.e. you'll find a signif result where none exists). You need to take account of this when doing multiple tests, so look at Bonferroni & Holm adjustments, FDR (false discovery rates) etc. This can be done via
p.adjust()
. -
Seb Matamoros over 10 yearsYes, good suggestion. However we use correlations mainly for exploratory purposes : finding which variables present the most correlations and then refocus the analysis on this particular variable. Nonetheless I performed FDR to adjust the p values, and compared the 2 sets of results.
-
JASC almost 4 yearsSimple and brilliant example of using the apply family.