How to calculate correlation between all columns and remove highly correlated ones using pandas?
Solution 1
The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/
import numpy as np
# Create correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
df.drop(to_drop, axis=1, inplace=True)
Solution 2
Here is the approach which I have used -
def correlation(dataset, threshold):
col_corr = set() # Set of all the names of deleted columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
colname = corr_matrix.columns[i] # getting the name of column
col_corr.add(colname)
if colname in dataset.columns:
del dataset[colname] # deleting the column from the dataset
print(dataset)
Hope this helps!
Solution 3
Here is an Auto ML class I created to eliminate multicollinearity between features.
What makes my code unique is that out two features that have high correlation, I have eliminated the feature that is least correlated with the target! I got the idea from this seminar by Vishal Patel Sir - https://www.youtube.com/watch?v=ioXKxulmwVQ&feature=youtu.be
#Feature selection class to eliminate multicollinearity
class MultiCollinearityEliminator():
#Class Constructor
def __init__(self, df, target, threshold):
self.df = df
self.target = target
self.threshold = threshold
#Method to create and return the feature correlation matrix dataframe
def createCorrMatrix(self, include_target = False):
#Checking we should include the target in the correlation matrix
if (include_target == False):
df_temp = self.df.drop([self.target], axis =1)
#Setting method to Pearson to prevent issues in case the default method for df.corr() gets changed
#Setting min_period to 30 for the sample size to be statistically significant (normal) according to
#central limit theorem
corrMatrix = df_temp.corr(method='pearson', min_periods=30).abs()
#Target is included for creating the series of feature to target correlation - Please refer the notes under the
#print statement to understand why we create the series of feature to target correlation
elif (include_target == True):
corrMatrix = self.df.corr(method='pearson', min_periods=30).abs()
return corrMatrix
#Method to create and return the feature to target correlation matrix dataframe
def createCorrMatrixWithTarget(self):
#After obtaining the list of correlated features, this method will help to view which variables
#(in the list of correlated features) are least correlated with the target
#This way, out the list of correlated features, we can ensure to elimate the feature that is
#least correlated with the target
#This not only helps to sustain the predictive power of the model but also helps in reducing model complexity
#Obtaining the correlation matrix of the dataframe (along with the target)
corrMatrix = self.createCorrMatrix(include_target = True)
#Creating the required dataframe, then dropping the target row
#and sorting by the value of correlation with target (in asceding order)
corrWithTarget = pd.DataFrame(corrMatrix.loc[:,self.target]).drop([self.target], axis = 0).sort_values(by = self.target)
print(corrWithTarget, '\n')
return corrWithTarget
#Method to create and return the list of correlated features
def createCorrelatedFeaturesList(self):
#Obtaining the correlation matrix of the dataframe (without the target)
corrMatrix = self.createCorrMatrix(include_target = False)
colCorr = []
#Iterating through the columns of the correlation matrix dataframe
for column in corrMatrix.columns:
#Iterating through the values (row wise) of the correlation matrix dataframe
for idx, row in corrMatrix.iterrows():
if(row[column]>self.threshold) and (row[column]<1):
#Adding the features that are not already in the list of correlated features
if (idx not in colCorr):
colCorr.append(idx)
if (column not in colCorr):
colCorr.append(column)
print(colCorr, '\n')
return colCorr
#Method to eliminate the least important features from the list of correlated features
def deleteFeatures(self, colCorr):
#Obtaining the feature to target correlation matrix dataframe
corrWithTarget = self.createCorrMatrixWithTarget()
for idx, row in corrWithTarget.iterrows():
print(idx, '\n')
if (idx in colCorr):
self.df = self.df.drop(idx, axis =1)
break
return self.df
#Method to run automatically eliminate multicollinearity
def autoEliminateMulticollinearity(self):
#Obtaining the list of correlated features
colCorr = self.createCorrelatedFeaturesList()
while colCorr != []:
#Obtaining the dataframe after deleting the feature (from the list of correlated features)
#that is least correlated with the taregt
self.df = self.deleteFeatures(colCorr)
#Obtaining the list of correlated features
colCorr = self.createCorrelatedFeaturesList()
return self.df
Solution 4
You can test this code below ?
Load libraries import
pandas as pd
import numpy as np
# Create feature matrix with two highly correlated features
X = np.array([[1, 1, 1],
[2, 2, 0],
[3, 3, 1],
[4, 4, 0],
[5, 5, 1],
[6, 6, 0],
[7, 7, 1],
[8, 7, 0],
[9, 7, 1]])
# Convert feature matrix into DataFrame
df = pd.DataFrame(X)
# View the data frame
df
# Create correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
df.drop(df[to_drop], axis=1)
Solution 5
You can use the following for a given data frame df:
corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
jax
Updated on July 05, 2022Comments
-
jax almost 2 years
I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..
Example data set
GA PN PC MBP GR AP 0.033 6.652 6.681 0.194 0.874 3.177 0.034 9.039 6.224 0.194 1.137 3.4 0.035 10.936 10.304 1.015 0.911 4.9 0.022 10.11 9.603 1.374 0.848 4.566 0.035 2.963 17.156 0.599 0.823 9.406 0.033 10.872 10.244 1.015 0.574 4.871 0.035 21.694 22.389 1.015 0.859 9.259 0.035 10.936 10.304 1.015 0.911 4.5
Please help....
-
cel about 9 yearsWhile I totally agree with your reasoning, this does not really answer the question.
PCA
is a more advanced concept for dimension reduction. But note that using correlations does work and the question is a reasonable (but definitely lacking research effort IMO). -
jax about 9 years@Jamie bull Thanks for your kind reply before going to advanced techniques like dimensionality reduction(Ex. PCA ) or Feature selection method (Ex. Tree based or SVM based feature elimination ) it is always suggested to remove useless feature with the help of basic techniques (like variance calculation of correlation calculation), that I learned with the help of various published works available. And as per the second part of your comment "correlations by calling DataFrame.corr()" would be helpful for my case.
-
cel about 9 years@jax,
it is always suggested to remove useless feature with the help of basic techniques
. This is not true. There are various methods which do not require such a preprocessing step. -
jax about 9 years@cel ok, actually i was following some published work so they have suggested the preprocessing steps. Can you please suggest me any one such method which not bother about preprocessing steps thanks .
-
Jamie Bull about 9 yearsThere's a discussion of when you should remove correlated variables before PCA here. It comes down to whether they are correlated because they are both influenced by each other or a third underlying feature, in which case there is an argument for removing one them. Or alternatively where they are correlated but not because they are truly related, in which case there is an argument for keeping both. This depends on understanding the variables and so isn't easily done algorithmically.
-
jax about 9 years@JamieBull Thanks for your reply i have already been there(the web link you have suggested) before posting this. But if you have gone through the Questions careful this post covers only half answer of the Question but i have already read a lot and hopefully soon i will post answer with my self. thanks a lot for all your support and interest. thanks
-
n1k31t4 almost 7 yearsThis doesn't seem to work for me. The correlations are found and the pairs that match the threshold (i.e. have a higher correlation) are printed. But the resulting dataframe is only missing one (the first) variable, that has a high correlation.
-
MyopicVisage over 6 yearsThis did not work for me. Please consider rewriting your solution as a method. Error: "ValueError: too many values to unpack (expected 2)".
-
Jeru Luke over 6 yearsIt should rather be
high_corr_var=[(corr_matrix.index[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
-
Ryan about 5 yearsThe loops you have here skip the first two columns of the corr_matrix, and so correlation between col1 & col2 is not considered, after that looks ok
-
vcovo about 5 yearsI feel like this solution fails in the following general case: Say you have columns c1, c2, and c3. c1 and c2 are correlated above the threshold, the same goes for c2 and c3. With this solution both c2 and c3 will be dropped even though c3 may not be correlated with c1 above that threshold. I suggest changing:
if corr_matrix.iloc[i, j] >= threshold:
To:if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):
-
NISHA DAGA about 5 years@vcovo If c1 & c2 are correlated and c2 & c3 are correlated, then there is a high chance that c1 & c3 will also be correlated. Although, if that is not true, then I believe that your suggestion of changing the code is correct.
-
vcovo about 5 yearsThey most likely would be correlated but not necessarily above the same
threshold
. This lead to a significant difference in removed columns for my use case. I ended up with 218 columns instead of 180 when adding the additional condition mentioned in the first comment. -
NISHA DAGA about 5 yearsMakes sense. Have updated the code as per your suggestion.
-
poPYtheSailor about 5 years@Ryan How did you fix that?
-
Ryan about 5 years@poPYtheSailor Please see my posted solution
-
Sushant Kulkarni over 4 yearsisn't this flawed? Always first column is dropped even though it might not be highly correlated with any other column. when upper triangle is selected none of the first col value remains
-
Cherry Wu over 4 yearshave you ever output corr_matrix and see what does it look like first?
-
Ikbel over 4 yearsI got an error while dropping the selected features, the following code worked for me
df.drop(to_drop,axis=1,inplace=True)
-
Cherry Wu over 4 years@ikbelbenabdessamad yeah, your code is better. I just updated that old version code, thank you!
-
borchvm about 4 yearsWhile this code may provide a solution to the question, it's better to add context as to why/how it works. This can help future users learn, and apply that knowledge to their own code. You are also likely to have positive feedback from users in the form of upvotes, when the code is explained.
-
Bedir Yilmaz almost 4 yearsHi! Welcome to SO. Thank you for the contribution! Here is a guide on how to share your knowledge: stackoverflow.blog/2011/07/01/…
-
Smart Manoj over 3 years@vcovo if c1 and c2 only correlated, how do we choose the best column to remove?
-
vcovo over 3 years@SmartManoj in my use case I just wanted to minimize the number of columns and thus removed highly correlated ones. I had no preference for which one to keep and thus removed the second one (as in the rightmost column). I suppose you could create a metric that takes in to account the correlation between each column and all others and then when presented with a highly correlated pair remove the one that is most correlated with all other columns (in order to preserve a little more of the variance).
-
Smart Manoj over 3 years
-
hipoglucido over 3 yearsShouldn't you use the absolute value of the correlation matrix?
-
SQLGIT_GeekInTraining over 3 yearsI really liked it! Have used it for a model I'm building and really easy to understand - thanks a ton for this.
-
Sunit Gautam over 3 yearsAs of the date of writing this comment, this seems to be working fine. I cross-checked for varying thresholds using other methods provided in answers, and results were identical. Thanks!
-
Rishabh Agrahari about 3 yearsThis will drop all columns with corr > 0.95, we want to drop all except one.
-
Anonymous about 3 yearsIndeed, absolute value makes much more sense as -0.9 is just as strong as 0.9
-
Yiğit Can Taşoğlu over 2 yearsIf we will add abs( ) function while calculating the correlation value between target and feature, we will not see negative correlation value. It is important because when we have negative correlation code drops smaller one which has stronger negative correlation value. /// col_corr = abs(df_model[col.values[0]].corr(df_model[target_var]))
-
Mehran over 2 yearsIt should be
corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool))
. Your code does not consider the first column at all. -
mjoy almost 2 yearsCan you provide an example of how to use?