Pandas Correlation Groupby

python pandas group-by correlation

38,840

Solution 1

You pretty much figured out all the pieces, just need to combine them:

>>> df.groupby('ID')[['Val1','Val2']].corr()

             Val1      Val2
ID                         
A  Val1  1.000000  0.500000
   Val2  0.500000  1.000000
B  Val1  1.000000  0.385727
   Val2  0.385727  1.000000

In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]

ID       
A   Val1    0.500000
B   Val1    0.385727

For the more general case of 3+ variables

For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1): 
    df2 = df2.append( df.groupby('ID')[groups].corr().stack()
                        .loc[:,groups[i],groups[i+1]:].reset_index() )

df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()

Note that if we didn't have the groupby element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

Solution 2

One more simple solution:

df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]

Solution 3

In the above answer; since ix has been depreciated use iloc instead with some minor other changes:

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series

38,840

Author by

Gohawks

Updated on July 09, 2022

Comments

Gohawks almost 2 years

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).

What I have:

ID  Val1    Val2    OtherData   OtherData
A   5       4       x           x
A   4       5       x           x
A   6       6       x           x
B   4       1       x           x
B   8       2       x           x
B   7       9       x           x
C   4       8       x           x
C   5       5       x           x
C   2       1       x           x

What I need:

ID  Correlation_Val1_Val2
A   0.12
B   0.22
C   0.05

Thanks!

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Partial Correlation in Python

Group By a Column and Sum contents of another column with Python

Python - Pandas subtotals on groupby

Pandas - Split dataframe into multiple dataframes based on dates?

python split a pandas data frame by week or month and group the data based on these sp

Save the output of a pandas groupby operation to CSV

selecting a particular row from groupby object in python

Pandas : Sum multiple columns and get results in multiple columns

Groupby column and find min and max of each group

Pandas GroupBy : How to get top n values based on a column