pandas columns correlation with statistical significance

74,220

Solution 1

Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Results:

-0.0422917932738 0.851762475093

Same results as stats function:

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Results:

  (-0.042291793273791969, 0.85176247509284908)

To extend to more vairables I give you an ugly loop based approach:

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

Results of rho:

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

Results of pval:

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]

Solution 2

To calculate all the p-values at once, you can use calculate_pvalues function (code below):

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df) 

The output is similar to the corr() (but with p-values):

            A       B       C
    A       0  0.7877  0.1789
    B  0.7877       0  0.6088
    C  0.1789  0.6088       0

Details:

  • Column D is automatically ignored as it contains text.
  • p-values are rounded to 4 decimals
  • You can subset to indicate exact columns: calculate_pvalues(df[['A','B','C']]

Following is the code of the function:

from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

Solution 3

Statistical significance denoted in asterisks:

from scipy.stats import pearsonr
import numpy as np
rho = df.corr()
pval = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01,0.05,0.1] if x<=t]))
rho.round(2).astype(str) + p

Correlations with asterisks

Solution 4

You can use the scipy.stats correlation functions to get the p-value.

For example, if you are looking for a correlation such as pearson correlation, you can use the pearsonr function.

from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])

Gives output

(0.7205766921228921, 0.48775429164459994)

Where the first value in the tuple is the correlation value, and second is the p-value.

In your case, you can use pandas' dropna function to remove NaN values first.

df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])

Solution 5

In pandas v0.24.0 a method argument was added to corr. Now, you can do:

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})

df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns)) 
          A         B         C
A  0.000000  0.787704  0.178912
B  0.787704  0.000000  0.608792
C  0.178912  0.608792  0.000000

Please note the workaround with np.eye(len(df.columns)) which is needed, because self-correlations are always set to 1.0 (see https://github.com/pandas-dev/pandas/issues/25726).

Share:
74,220

Related videos on Youtube

wolfsatthedoor
Author by

wolfsatthedoor

Updated on July 25, 2022

Comments

  • wolfsatthedoor
    wolfsatthedoor almost 2 years

    What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1 and df.2?

    I do not want the output to count rows with NaN, which pandas built-in correlation does. But I also want it to output a pvalue or a standard error, which the built-in does not.

    SciPy seems to get caught up by the NaNs, though I believe it does report significance.

    Data example:

         1           2
    0    2          NaN
    1    NaN         1
    2    1           2
    3    -4          3
    4    1.3         1
    5    NaN         NaN
    
    • Dalek
      Dalek over 9 years
      could you provide an example of your data?
  • wolfsatthedoor
    wolfsatthedoor over 9 years
    What if there are more than 2 columns, is there a way to get a nice output table for correlations?
  • BKay
    BKay over 9 years
    df.corr() will give you the correlation structure for the whole data frame but to use the regression calculation approach of the p-value would be messy.
  • Tim Diekmann
    Tim Diekmann almost 6 years
    Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others.
  • Max Ghenis
    Max Ghenis about 5 years
    pd.ols was deprecated in v 0.20.0, and DataReader was moved to pandas-datareader: github.com/pydata/pandas-datareader
  • wjie08
    wjie08 over 3 years
    This is great for implementing. Any chance that it can be worked into a sns.heatmap with np.triu as mask?
  • Susmit
    Susmit over 3 years
    Not exactly slick, but this works and gets the desired output, p = pd.DataFrame([[pearsonr(df[c], df[y])[1] for y in df.columns] for c in df.columns], columns=df.columns, index=df.columns).copy() p["type"] = "p" p.index.name="col" p = p.set_index([p.index,"type"]) c = df.corr() c["type"] = "c" c.index.name = "col" c = c.set_index([c.index,"type"]) c.combine_first(p)