How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering

10,556

You can define a custom affinity matrix as a function which takes in your data and returns the affinity matrix:

from scipy.stats import pearsonr
import numpy as np

def pearson_affinity(M):
   return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])

Then you can call the agglomerative clustering with this as the affinity function (you have to change the linkage, since 'ward' only works for euclidean distance.

cluster = AgglomerativeClustering(n_clusters=3, linkage='average',
                           affinity=pearson_affinity)
cluster.fit(X)

Note that it doesn't seem to work very well for your data for some reason:

cluster.labels_
Out[107]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0])
Share:
10,556
pdubois
Author by

pdubois

Updated on July 25, 2022

Comments

  • pdubois
    pdubois over 1 year

    I have the following data:

    State   Murder  Assault UrbanPop    Rape
    Alabama 13.200  236 58  21.200
    Alaska  10.000  263 48  44.500
    Arizona 8.100   294 80  31.000
    Arkansas    8.800   190 50  19.500
    California  9.000   276 91  40.600
    Colorado    7.900   204 78  38.700
    Connecticut 3.300   110 77  11.100
    Delaware    5.900   238 72  15.800
    Florida 15.400  335 80  31.900
    Georgia 17.400  211 60  25.800
    Hawaii  5.300   46  83  20.200
    Idaho   2.600   120 54  14.200
    Illinois    10.400  249 83  24.000
    Indiana 7.200   113 65  21.000
    Iowa    2.200   56  57  11.300
    Kansas  6.000   115 66  18.000
    Kentucky    9.700   109 52  16.300
    Louisiana   15.400  249 66  22.200
    Maine   2.100   83  51  7.800
    Maryland    11.300  300 67  27.800
    Massachusetts   4.400   149 85  16.300
    Michigan    12.100  255 74  35.100
    Minnesota   2.700   72  66  14.900
    Mississippi 16.100  259 44  17.100
    Missouri    9.000   178 70  28.200
    Montana 6.000   109 53  16.400
    Nebraska    4.300   102 62  16.500
    Nevada  12.200  252 81  46.000
    New Hampshire   2.100   57  56  9.500
    New Jersey  7.400   159 89  18.800
    New Mexico  11.400  285 70  32.100
    New York    11.100  254 86  26.100
    North Carolina  13.000  337 45  16.100
    North Dakota    0.800   45  44  7.300
    Ohio    7.300   120 75  21.400
    Oklahoma    6.600   151 68  20.000
    Oregon  4.900   159 67  29.300
    Pennsylvania    6.300   106 72  14.900
    Rhode Island    3.400   174 87  8.300
    South Carolina  14.400  279 48  22.500
    South Dakota    3.800   86  45  12.800
    Tennessee   13.200  188 59  26.900
    Texas   12.700  201 80  25.500
    Utah    3.200   120 80  22.900
    Vermont 2.200   48  32  11.200
    Virginia    8.500   156 63  20.700
    Washington  4.000   145 73  26.200
    West Virginia   5.700   81  39  9.300
    Wisconsin   2.600   53  66  10.800
    Wyoming 6.800   161 60  15.600
    

    Which I use to perform the hierarchical clustering based on the state. This is the full working code:

    import pandas as pd 
    from sklearn.cluster import AgglomerativeClustering
    df = pd.io.parsers.read_table("http://dpaste.com/031VZPM.txt")
    samples = df["State"].tolist()
    ndf = df[["Murder", "Assault", "UrbanPop","Rape"]]
    X = ndf.as_matrix()
    
    cluster = AgglomerativeClustering(n_clusters=3, 
                                   linkage='complete',affinity='euclidean').fit(X)
    label = cluster.labels_
    outclust = list(zip(label, samples))  
    outclust_df = pd.DataFrame(outclust,columns=["Clusters","Samples"])  
    
    for clust in outclust_df.groupby("Clusters"):
        print (clust)
    

    Notice that in that method I use euclidean distance. What I want to do is to use 1-Pearson correlation distance. In R it looks like this:

    dat <- read.table("http://dpaste.com/031VZPM.txt",sep="\t",header=TRUE)
    dist2 = function(x) as.dist(1-cor(t(x), method="pearson"))
    dat = dat[c("Murder","Assault","UrbanPop","Rape")]
    hclust(dist2(dat), method="ward.D")
    

    How can I achieve that using Scikit-learn AgglomerativeClustering? I understand that there is the 'precomputed' arguments for affinity. But not sure how to use that to address my problem.

  • pdubois
    pdubois over 8 years
    code doesn't work for me. For examle pearson_affinity(X) failed. And by the way why df.values in the function? Is it M a np matrix or pandas data frame?
  • maxymoo
    maxymoo over 8 years
    ok fixed the code now, and changed to one-minus, and the clusters are a little better ;) M is an np.array
  • iMad
    iMad about 6 years
    @maxymoo can you please explain why 'ward' linkage only works with euclidean distance ?
  • maxymoo
    maxymoo about 6 years
    @iMad it's just from the definition of ward's method, although this seems to be an active area of research, see e.g. journals.plos.org/plosone/article?id=10.1371/…