pandas columns correlation with statistical significance

python pandas scipy correlation

74,220

Solution 1

Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Results:

-0.0422917932738 0.851762475093

Same results as stats function:

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Results:

  (-0.042291793273791969, 0.85176247509284908)

To extend to more vairables I give you an ugly loop based approach:

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

Results of rho:

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

Results of pval:

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]

Solution 2

To calculate all the p-values at once, you can use calculate_pvalues function (code below):

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df)

The output is similar to the corr() (but with p-values):

            A       B       C
    A       0  0.7877  0.1789
    B  0.7877       0  0.6088
    C  0.1789  0.6088       0

Details:

Column D is automatically ignored as it contains text.
p-values are rounded to 4 decimals
You can subset to indicate exact columns: calculate_pvalues(df[['A','B','C']]

Following is the code of the function:

from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

Solution 3

Statistical significance denoted in asterisks:

from scipy.stats import pearsonr
import numpy as np
rho = df.corr()
pval = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01,0.05,0.1] if x<=t]))
rho.round(2).astype(str) + p

Solution 4

You can use the scipy.stats correlation functions to get the p-value.

For example, if you are looking for a correlation such as pearson correlation, you can use the pearsonr function.

from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])

Gives output

(0.7205766921228921, 0.48775429164459994)

Where the first value in the tuple is the correlation value, and second is the p-value.

In your case, you can use pandas' dropna function to remove NaN values first.

df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])

Solution 5

In pandas v0.24.0 a method argument was added to corr. Now, you can do:

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})

df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))

          A         B         C
A  0.000000  0.787704  0.178912
B  0.787704  0.000000  0.608792
C  0.178912  0.608792  0.000000

Please note the workaround with np.eye(len(df.columns)) which is needed, because self-correlations are always set to 1.0 (see https://github.com/pandas-dev/pandas/issues/25726).

View more solutions

74,220

wolfsatthedoor

Updated on July 25, 2022

Comments

wolfsatthedoor almost 2 years
What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1 and df.2?

I do not want the output to count rows with NaN, which pandas built-in correlation does. But I also want it to output a pvalue or a standard error, which the built-in does not.

SciPy seems to get caught up by the NaNs, though I believe it does report significance.

Data example:
```
     1           2
0    2          NaN
1    NaN         1
2    1           2
3    -4          3
4    1.3         1
5    NaN         NaN
```
- Dalek over 9 years
  
  could you provide an example of your data?
wolfsatthedoor over 9 years

What if there are more than 2 columns, is there a way to get a nice output table for correlations?
BKay over 9 years

df.corr() will give you the correlation structure for the whole data frame but to use the regression calculation approach of the p-value would be messy.
Tim Diekmann almost 6 years

Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others.
Max Ghenis about 5 years

pd.ols was deprecated in v 0.20.0, and DataReader was moved to pandas-datareader: github.com/pydata/pandas-datareader
wjie08 over 3 years

This is great for implementing. Any chance that it can be worked into a sns.heatmap with np.triu as mask?
Susmit over 3 years

Not exactly slick, but this works and gets the desired output, p = pd.DataFrame([[pearsonr(df[c], df[y])[1] for y in df.columns] for c in df.columns], columns=df.columns, index=df.columns).copy() p["type"] = "p" p.index.name="col" p = p.set_index([p.index,"type"]) c = df.corr() c["type"] = "c" c.index.name = "col" c = c.set_index([c.index,"type"]) c.combine_first(p)