How to find significant correlations in a large dataset

22,102

Solution 1

Here's some sample data for reproducibility.

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.

correlations <- vapply(
  the_data[, -1],
  function(x)
  {
    cor(the_data[, 1], x)
  },
  numeric(1)
)

You can then find the column with the largest magnitude of correlation with y using:

correlations[which.max(abs(correlations))]

So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.

Solution 2

You can use the function rcorr from the package Hmisc.

Using the same demo data from Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Then:

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

To access the p-values:

correlations$P

To visualize you can use the package corrgram

library(corrgram)
corrgram(the_data)

Which will produce: enter image description here

Solution 3

In order to print a list of the significant correlations (p < 0.05), you can use the following.

  1. Using the same demo data from @Richie:

    m <- 40
    n <- 80
    the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
    colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
    
  2. Install Hmisc

    install.packages("Hmisc")
    
  3. Import library and find the correlations (@Carlos)

    library(Hmisc)
    correlations <- rcorr(as.matrix(the_data))
    
  4. Loop over the values printing the significant correlations

    for (i in 1:m){
      for (j in 1:m){
        if ( !is.na(correlations$P[i,j])){
          if ( correlations$P[i,j] < 0.05 ) {
            print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
          }
        }
      }
    }
    

Warning

You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.

Share:
22,102
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm using R. My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.

    I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b. Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.

    I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?

  • Abdul Basit Khan
    Abdul Basit Khan about 5 years
    The issue with this is that all variables are assumed to be numerical. What if the dependent variable i.e. "y" is a factor ?
  • SAVAFA
    SAVAFA about 4 years
    It should be correlations$P[i,j] > 0.05 not correlations$P[i,j] < 0.05
  • toto_tico
    toto_tico about 4 years
    @SAVAFA, the code seems right correlations$P[i,j] < 0.05, I did a mistake above in the description, (p < 0.05) instead of (p > 0.05).