How to calculate correlation between all columns and remove highly correlated ones using pandas?

119,493

Solution 1

The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

Solution 2

Here is the approach which I have used -

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

Hope this helps!

Solution 3

Here is an Auto ML class I created to eliminate multicollinearity between features.

What makes my code unique is that out two features that have high correlation, I have eliminated the feature that is least correlated with the target! I got the idea from this seminar by Vishal Patel Sir - https://www.youtube.com/watch?v=ioXKxulmwVQ&feature=youtu.be

#Feature selection class to eliminate multicollinearity
class MultiCollinearityEliminator():
    
    #Class Constructor
    def __init__(self, df, target, threshold):
        self.df = df
        self.target = target
        self.threshold = threshold

    #Method to create and return the feature correlation matrix dataframe
    def createCorrMatrix(self, include_target = False):
        #Checking we should include the target in the correlation matrix
        if (include_target == False):
            df_temp = self.df.drop([self.target], axis =1)
            
            #Setting method to Pearson to prevent issues in case the default method for df.corr() gets changed
            #Setting min_period to 30 for the sample size to be statistically significant (normal) according to 
            #central limit theorem
            corrMatrix = df_temp.corr(method='pearson', min_periods=30).abs()
        #Target is included for creating the series of feature to target correlation - Please refer the notes under the 
        #print statement to understand why we create the series of feature to target correlation
        elif (include_target == True):
            corrMatrix = self.df.corr(method='pearson', min_periods=30).abs()
        return corrMatrix

    #Method to create and return the feature to target correlation matrix dataframe
    def createCorrMatrixWithTarget(self):
        #After obtaining the list of correlated features, this method will help to view which variables 
        #(in the list of correlated features) are least correlated with the target
        #This way, out the list of correlated features, we can ensure to elimate the feature that is 
        #least correlated with the target
        #This not only helps to sustain the predictive power of the model but also helps in reducing model complexity
        
        #Obtaining the correlation matrix of the dataframe (along with the target)
        corrMatrix = self.createCorrMatrix(include_target = True)                           
        #Creating the required dataframe, then dropping the target row 
        #and sorting by the value of correlation with target (in asceding order)
        corrWithTarget = pd.DataFrame(corrMatrix.loc[:,self.target]).drop([self.target], axis = 0).sort_values(by = self.target)                    
        print(corrWithTarget, '\n')
        return corrWithTarget

    #Method to create and return the list of correlated features
    def createCorrelatedFeaturesList(self):
        #Obtaining the correlation matrix of the dataframe (without the target)
        corrMatrix = self.createCorrMatrix(include_target = False)                          
        colCorr = []
        #Iterating through the columns of the correlation matrix dataframe
        for column in corrMatrix.columns:
            #Iterating through the values (row wise) of the correlation matrix dataframe
            for idx, row in corrMatrix.iterrows():                                            
                if(row[column]>self.threshold) and (row[column]<1):
                    #Adding the features that are not already in the list of correlated features
                    if (idx not in colCorr):
                        colCorr.append(idx)
                    if (column not in colCorr):
                        colCorr.append(column)
        print(colCorr, '\n')
        return colCorr

    #Method to eliminate the least important features from the list of correlated features
    def deleteFeatures(self, colCorr):
        #Obtaining the feature to target correlation matrix dataframe
        corrWithTarget = self.createCorrMatrixWithTarget()                                  
        for idx, row in corrWithTarget.iterrows():
            print(idx, '\n')
            if (idx in colCorr):
                self.df = self.df.drop(idx, axis =1)
                break
        return self.df

    #Method to run automatically eliminate multicollinearity
    def autoEliminateMulticollinearity(self):
        #Obtaining the list of correlated features
        colCorr = self.createCorrelatedFeaturesList()                                       
        while colCorr != []:
            #Obtaining the dataframe after deleting the feature (from the list of correlated features) 
            #that is least correlated with the taregt
            self.df = self.deleteFeatures(colCorr)
            #Obtaining the list of correlated features
            colCorr = self.createCorrelatedFeaturesList()                                     
        return self.df

Solution 4

You can test this code below ?

Load libraries import

  pandas as pd
  import numpy as np
# Create feature matrix with two highly correlated features

X = np.array([[1, 1, 1],
          [2, 2, 0],
          [3, 3, 1],
          [4, 4, 0],
          [5, 5, 1],
          [6, 6, 0],
          [7, 7, 1],
          [8, 7, 0],
          [9, 7, 1]])

# Convert feature matrix into DataFrame
df = pd.DataFrame(X)

# View the data frame
df

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features 
df.drop(df[to_drop], axis=1)

Solution 5

You can use the following for a given data frame df:

corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
Share:
119,493
jax
Author by

jax

Updated on July 05, 2022

Comments

  • jax
    jax almost 2 years

    I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..

    Example data set

     GA      PN       PC     MBP      GR     AP   
    0.033   6.652   6.681   0.194   0.874   3.177    
    0.034   9.039   6.224   0.194   1.137   3.4      
    0.035   10.936  10.304  1.015   0.911   4.9      
    0.022   10.11   9.603   1.374   0.848   4.566    
    0.035   2.963   17.156  0.599   0.823   9.406    
    0.033   10.872  10.244  1.015   0.574   4.871     
    0.035   21.694  22.389  1.015   0.859   9.259     
    0.035   10.936  10.304  1.015   0.911   4.5       
    

    Please help....

  • cel
    cel about 9 years
    While I totally agree with your reasoning, this does not really answer the question. PCA is a more advanced concept for dimension reduction. But note that using correlations does work and the question is a reasonable (but definitely lacking research effort IMO).
  • jax
    jax about 9 years
    @Jamie bull Thanks for your kind reply before going to advanced techniques like dimensionality reduction(Ex. PCA ) or Feature selection method (Ex. Tree based or SVM based feature elimination ) it is always suggested to remove useless feature with the help of basic techniques (like variance calculation of correlation calculation), that I learned with the help of various published works available. And as per the second part of your comment "correlations by calling DataFrame.corr()" would be helpful for my case.
  • cel
    cel about 9 years
    @jax, it is always suggested to remove useless feature with the help of basic techniques. This is not true. There are various methods which do not require such a preprocessing step.
  • jax
    jax about 9 years
    @cel ok, actually i was following some published work so they have suggested the preprocessing steps. Can you please suggest me any one such method which not bother about preprocessing steps thanks .
  • Jamie Bull
    Jamie Bull about 9 years
    There's a discussion of when you should remove correlated variables before PCA here. It comes down to whether they are correlated because they are both influenced by each other or a third underlying feature, in which case there is an argument for removing one them. Or alternatively where they are correlated but not because they are truly related, in which case there is an argument for keeping both. This depends on understanding the variables and so isn't easily done algorithmically.
  • jax
    jax about 9 years
    @JamieBull Thanks for your reply i have already been there(the web link you have suggested) before posting this. But if you have gone through the Questions careful this post covers only half answer of the Question but i have already read a lot and hopefully soon i will post answer with my self. thanks a lot for all your support and interest. thanks
  • n1k31t4
    n1k31t4 almost 7 years
    This doesn't seem to work for me. The correlations are found and the pairs that match the threshold (i.e. have a higher correlation) are printed. But the resulting dataframe is only missing one (the first) variable, that has a high correlation.
  • MyopicVisage
    MyopicVisage over 6 years
    This did not work for me. Please consider rewriting your solution as a method. Error: "ValueError: too many values to unpack (expected 2)".
  • Jeru Luke
    Jeru Luke over 6 years
    It should rather be high_corr_var=[(corr_matrix.index[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
  • Ryan
    Ryan about 5 years
    The loops you have here skip the first two columns of the corr_matrix, and so correlation between col1 & col2 is not considered, after that looks ok
  • vcovo
    vcovo about 5 years
    I feel like this solution fails in the following general case: Say you have columns c1, c2, and c3. c1 and c2 are correlated above the threshold, the same goes for c2 and c3. With this solution both c2 and c3 will be dropped even though c3 may not be correlated with c1 above that threshold. I suggest changing: if corr_matrix.iloc[i, j] >= threshold: To: if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):
  • NISHA DAGA
    NISHA DAGA about 5 years
    @vcovo If c1 & c2 are correlated and c2 & c3 are correlated, then there is a high chance that c1 & c3 will also be correlated. Although, if that is not true, then I believe that your suggestion of changing the code is correct.
  • vcovo
    vcovo about 5 years
    They most likely would be correlated but not necessarily above the same threshold. This lead to a significant difference in removed columns for my use case. I ended up with 218 columns instead of 180 when adding the additional condition mentioned in the first comment.
  • NISHA DAGA
    NISHA DAGA about 5 years
    Makes sense. Have updated the code as per your suggestion.
  • poPYtheSailor
    poPYtheSailor about 5 years
    @Ryan How did you fix that?
  • Ryan
    Ryan about 5 years
    @poPYtheSailor Please see my posted solution
  • Sushant Kulkarni
    Sushant Kulkarni over 4 years
    isn't this flawed? Always first column is dropped even though it might not be highly correlated with any other column. when upper triangle is selected none of the first col value remains
  • Cherry Wu
    Cherry Wu over 4 years
    have you ever output corr_matrix and see what does it look like first?
  • Ikbel
    Ikbel over 4 years
    I got an error while dropping the selected features, the following code worked for me df.drop(to_drop,axis=1,inplace=True)
  • Cherry Wu
    Cherry Wu over 4 years
    @ikbelbenabdessamad yeah, your code is better. I just updated that old version code, thank you!
  • borchvm
    borchvm about 4 years
    While this code may provide a solution to the question, it's better to add context as to why/how it works. This can help future users learn, and apply that knowledge to their own code. You are also likely to have positive feedback from users in the form of upvotes, when the code is explained.
  • Bedir Yilmaz
    Bedir Yilmaz almost 4 years
    Hi! Welcome to SO. Thank you for the contribution! Here is a guide on how to share your knowledge: stackoverflow.blog/2011/07/01/…
  • Smart Manoj
    Smart Manoj over 3 years
    @vcovo if c1 and c2 only correlated, how do we choose the best column to remove?
  • vcovo
    vcovo over 3 years
    @SmartManoj in my use case I just wanted to minimize the number of columns and thus removed highly correlated ones. I had no preference for which one to keep and thus removed the second one (as in the rightmost column). I suppose you could create a metric that takes in to account the correlation between each column and all others and then when presented with a highly correlated pair remove the one that is most correlated with all other columns (in order to preserve a little more of the variance).
  • Smart Manoj
    Smart Manoj over 3 years
  • hipoglucido
    hipoglucido over 3 years
    Shouldn't you use the absolute value of the correlation matrix?
  • SQLGIT_GeekInTraining
    SQLGIT_GeekInTraining over 3 years
    I really liked it! Have used it for a model I'm building and really easy to understand - thanks a ton for this.
  • Sunit Gautam
    Sunit Gautam over 3 years
    As of the date of writing this comment, this seems to be working fine. I cross-checked for varying thresholds using other methods provided in answers, and results were identical. Thanks!
  • Rishabh Agrahari
    Rishabh Agrahari about 3 years
    This will drop all columns with corr > 0.95, we want to drop all except one.
  • Anonymous
    Anonymous about 3 years
    Indeed, absolute value makes much more sense as -0.9 is just as strong as 0.9
  • Yiğit Can Taşoğlu
    Yiğit Can Taşoğlu over 2 years
    If we will add abs( ) function while calculating the correlation value between target and feature, we will not see negative correlation value. It is important because when we have negative correlation code drops smaller one which has stronger negative correlation value. /// col_corr = abs(df_model[col.values[0]].corr(df_model[target_var]))
  • Mehran
    Mehran over 2 years
    It should be corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool)). Your code does not consider the first column at all.
  • mjoy
    mjoy almost 2 years
    Can you provide an example of how to use?