Using shapiro.test on multiple columns in a data frame

r function statistics dataframe

27,341

Solution 1

Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply() or lapply() (or even apply(), but for data frames, one of the two earlier-mentioned functions would be best).

Here is an example, using some dummy data:

set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2), 
                 Uniform = runif(50))

Now apply the shapiro.test() function. We capture the output in a list (given the object returned by this function) so we will use lapply().

lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results

R> lshap[[1]]

    Shapiro-Wilk normality test

data:  X[[1L]]
W = 0.9802, p-value = 0.5611

You will need to extract the things you want from these objects, which all have the structure:

R> str(lshap[[1]])
List of 4
 $ statistic: Named num 0.98
  ..- attr(*, "names")= chr "W"
 $ p.value  : num 0.561
 $ method   : chr "Shapiro-Wilk normality test"
 $ data.name: chr "X[[1L]]"
 - attr(*, "class")= chr "htest"

If you want the statistic and p.value components of this object for all elements of lshap, we will use sapply() this time, to nicely arrange the results for us:

lres <- sapply(lshap, `[`, c("statistic","p.value"))

R> lres
          Gaussian Poisson Uniform 
statistic 0.9802   0.9371  0.918   
p.value   0.5611   0.01034 0.001998

Given that you have 500 of these, I'd transpose lres:

R> t(lres)
         statistic p.value 
Gaussian 0.9802    0.5611  
Poisson  0.9371    0.01034 
Uniform  0.918     0.001998

If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.

Solution 2

To apply some function over rows or columns of a data frame, one uses apply family:

df <- data.frame(a=rnorm(100), b=rnorm(100))    
df.shapiro <- apply(df, 2, shapiro.test)
df.shapiro
$a

    Shapiro-Wilk normality test

data:  newX[, i]
W = 0.9895, p-value = 0.6276


$b

    Shapiro-Wilk normality test

data:  newX[, i]
W = 0.9854, p-value = 0.3371

Note that column names are preserved, and df.shapiro is a named list.

Now, if you want, say, a vector of p-values, all you have to do is to extract them from appropriate lists:

unlist(lapply(df.shapiro, function(x) x$p.value))
        a         b 
0.6275521 0.3370931

Solution 3

Use do.call with rbind and lapply for more simple and compact solution:

df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
#>   statistic p.value    
#> a 0.986224  0.3875904  
#> b 0.9894938 0.6238027
#> c 0.9652532 0.009694794

27,341

Author by

Seb Matamoros

Microbiologist, molecular biologist, bio-informatician and bio-statistician. Interested mostly in microbiome studies.

Updated on July 05, 2022

Comments

Seb Matamoros almost 2 years
I have a dataframe (let's call it df), containing n=100 columns (C1, C2,..., C100) and 50 rows (R1, R2,...,R50). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each column has a normal distribution using the shapiro.test() function.

I am able to do it column by colums using the code :
```
> shapiro.test(df$Cn)
```
or
```
> shapiro.test(df[,c(Cn)])
```
However, when I try to do it on several columns at the same time it doesn't work :
```
> shapiro.test(df[,c(C1:C100)])
```
returns the error :

Error in [.data.frame(x, complete.cases(x)) : undefined columns selected

I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.
tonytonov over 10 years

The final note is brilliant.
Seb Matamoros over 10 years

Thanks, it works perfectly. As to what to do with it, well... I need to do multiple correlations between the different columns of this matrix. I would do non-parametric correlations, but my boss is allergic to non-parametric and insists on parametric. I'll see if it's possible to transform the data to obtain normal distribution...
Gavin Simpson over 10 years

Whatever you do, you need to correct for doing all these tests. If you did 100 tests and used the usual 0.05 (alpha=0.95) significance level then you are accepting that you'll reject the NULL (H0) on average 5 times in 100 when HO is correct (i.e. you'll find a signif result where none exists). You need to take account of this when doing multiple tests, so look at Bonferroni & Holm adjustments, FDR (false discovery rates) etc. This can be done via p.adjust().
Seb Matamoros over 10 years

Yes, good suggestion. However we use correlations mainly for exploratory purposes : finding which variables present the most correlations and then refocus the analysis on this particular variable. Nonetheless I performed FDR to adjust the p values, and compared the 2 sets of results.
JASC almost 4 years

Simple and brilliant example of using the apply family.