Find groups with high cross correlation matrix in Matlab

11,352

Solution 1

This is a good problem for hierarchical clustering. Using complete linkage clustering you will get compact clusters, all you have to do is determine the cutoff distance, at which two clusters should be considered different.

First, you need to convert the correlation matrix to a dissimilarity matrix. Since correlation is between 0 and 1, 1-correlation will work well - high correlations get a score close to 0, and low correlations get a score close to 1. Assume that the correlations are stored in an array corrMat

%# remove diagonal elements
corrMat = corrMat - eye(size(corrMat));
%# and convert to a vector (as pdist)
dissimilarity = 1 - corrMat(find(corrMat))';

%# decide on a cutoff
%# remember that 0.4 corresponds to corr of 0.6!
cutoff = 0.5; 

%# perform complete linkage clustering
Z = linkage(dissimilarity,'complete');

%# group the data into clusters
%# (cutoff is at a correlation of 0.5)
groups = cluster(Z,'cutoff',cutoff,'criterion','distance')
groups =
     2
     3
     2
     2
     3
     2
     1

To confirm that everything is great, you can visualize the dendrogram

dendrogram(Z,0,'colorthreshold',cutoff)

enter image description here

Solution 2

You can use the following method instead of creating the dissimilarity matrix.

Z = linkage(corrMat,'complete','correlation')

This allows Matlab to interpret your matrix as correlation distance and then, you can plot the dendrogram as follows:

dendrogram(Z);

One way to verify if your dendrogram is right or not is by checking its maximum height which should correspond to 1-min(corrMat). If the minimum value in corrMat is 0 then the maximum height of your tree should be 1. If the minimum value is -1 (negative correlation), the height should be 2.

Share:
11,352
user1641496
Author by

user1641496

Updated on June 03, 2022

Comments

  • user1641496
    user1641496 almost 2 years

    Given a lower triangular matrix (100x100) containg cross-correlation values, where entry 'ij' is the correlation value between signal 'i' and 'j' and so a high value means that these two signals belong to the same class of objects, and knowing there are at most four distinct classes in the data set, does someone know of a fast and effective way to classify the data and assign all the signals to the 4 different classes, rather than search and cross check all the entries against each other? The following 7x7 matrix may help illustrate the point:

     1      0       0       0       0       0       0
    .2      1       0       0       0       0       0
    .8      .15     1       0       0       0       0
    .9      .17     .8      1       0       0       0
    .23     .8      .15     .14     1       0       0
    .7      .13     .77     .83.    .11     1       0
    .1      .21     .19     .11     .17     .16     1
    

    there are three classes in this example:

    class 1: rows <1 3 4 6>,
    class 2: rows <2 5>,
    class 3: rows <7>
    
  • user1641496
    user1641496 over 11 years
    Thanks indeed for the answers guys.
  • Jonas
    Jonas over 11 years
    @user1641496: If you found an answer helpful, please consider upvoting/accepting it.
  • lukebuehler
    lukebuehler over 10 years
    just found this link for more information about hierarchical clustering for correlations: research.stowers-institute.org/efg/R/Visualization/cor-clust‌​er
  • Grzenio
    Grzenio almost 10 years
    Correlation is between -1 and 1 in general, no?
  • Jonas
    Jonas almost 10 years
    @Grzenio: depends on the definition of correlation, but in principle, yes. Here, I assume the correlation is 0-1 since the strength, but not the sign, of the correlation is of interest.