scipy linkage format

33,161

Solution 1

This is from the scipy.cluster.hierarchy.linkage() function documentation, I think it's a pretty clear description for the output format:

A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

Do you need something more?

Solution 2

I agree with https://stackoverflow.com/users/1167475/mortonjt that the documentation does not fully explain the indexing of intermediate clusters, while I do agree with the https://stackoverflow.com/users/1354844/dkar that the format is otherwise precisely explained.

Using the example data from this question: Tutorial for scipy.cluster.hierarchy

A = np.array([[0.1,   2.5],
              [1.5,   .4 ],
              [0.3,   1  ],
              [1  ,   .8 ],
              [0.5,   0  ],
              [0  ,   0.5],
              [0.5,   0.5],
              [2.7,   2  ],
              [2.2,   3.1],
              [3  ,   2  ],
              [3.2,   1.3]])

A linkage matrix can be built using the single (i.e, the closest matching points):

z = hac.linkage(a, method="single")

 array([[  7.        ,   9.        ,   0.3       ,   2.        ],
        [  4.        ,   6.        ,   0.5       ,   2.        ],
        [  5.        ,  12.        ,   0.5       ,   3.        ],
        [  2.        ,  13.        ,   0.53851648,   4.        ],
        [  3.        ,  14.        ,   0.58309519,   5.        ],
        [  1.        ,  15.        ,   0.64031242,   6.        ],
        [ 10.        ,  11.        ,   0.72801099,   3.        ],
        [  8.        ,  17.        ,   1.2083046 ,   4.        ],
        [  0.        ,  16.        ,   1.5132746 ,   7.        ],
        [ 18.        ,  19.        ,   1.92353841,  11.        ]])

As the documentation explains the clusters below n (here: 11) are simply the data points in the original matrix A. The intermediate clusters going forward, are indexed successively.

Thus, clusters 7 and 9 (the first merge) are merged into cluster 11, clusters 4 and 6 into 12. Then observe line three, merging clusters 5 (from A) and 12 (from the not-shown intermediate cluster 12) resulting with a Within-Cluster Distance (WCD) of 0.5. The single method entails that the new WCS is 0.5, which is the distance between A[5] and the closest point in cluster 12, A[4] and A[6]. Let's check:

 In [198]: norm([a[5]-a[4]])
 Out[198]: 0.70710678118654757
 In [199]: norm([a[5]-a[6]])
 Out[199]: 0.5

This cluster should now be intermediate cluster 13, which subsequently is merged with A[2]. Thus, the new distance should be the closest between the points A[2] and A[4,5,6].

 In [200]: norm([a[2]-a[4]])
 Out[200]: 1.019803902718557
 In [201]: norm([a[2]-a[5]])
 Out[201]: 0.58309518948452999
 In [202]: norm([a[2]-a[6]])
 Out[202]: 0.53851648071345048

Which, as can be seen also checks out, and explains the intermediate format of new clusters.

Solution 3

The scipy documentation is accurate as dkar pointed out ... but it's a little bit difficult to turn the returned data into something that is usable for further analysis.

In my opinion they should include the ability to return the data in a tree like data structure. The code below will iterate through the matrix and build a tree:

from scipy.cluster.hierarchy import linkage
import numpy as np

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
centers = np.concatenate((a, b),)

def create_tree(centers):
    clusters = {}
    to_merge = linkage(centers, method='single')
    for i, merge in enumerate(to_merge):
        if merge[0] <= len(to_merge):
            # if it is an original point read it from the centers array
            a = centers[int(merge[0]) - 1]
        else:
            # other wise read the cluster that has been created
            a = clusters[int(merge[0])]

        if merge[1] <= len(to_merge):
            b = centers[int(merge[1]) - 1]
        else:
            b = clusters[int(merge[1])]
        # the clusters are 1-indexed by scipy
        clusters[1 + i + len(to_merge)] = {
            'children' : [a, b]
        }
        # ^ you could optionally store other info here (e.g distances)
    return clusters

print create_tree(centers)
Share:
33,161
geo_pythoncl
Author by

geo_pythoncl

Updated on July 09, 2022

Comments

  • geo_pythoncl
    geo_pythoncl almost 2 years

    I have written my own clustering routine and would like to produce a dendrogram. The easiest way to do this would be to use scipy dendrogram function. However, this requires the input to be in the same format that the scipy linkage function produces. I cannot find an example of how the output of this is formatted. I was wondering whether someone out there can enlighten me.

  • mortonjt
    mortonjt over 7 years
    Yeah, more information would be extremely helpful. For instance, if I were to enumerate through all of the indices, what sort of a traversal of it be? How exactly are the nodes labeled? Could you please if a clear and precise example, and step through exactly how this is formatted, along with the tree and all of the labels corresponding to each of the nodes.
  • Blake Regalia
    Blake Regalia almost 4 years
    what is members referring to in the for loop? it's undefined
  • David Bernat
    David Bernat almost 4 years
    It has been a while since I reviewed this code. Members should refer to the indices which are deigned to be in the cluster. If members is not present, is it possible you have configured the model to create more clusters than necessary? Worse case scenario, perhaps you can simply set members to an empty set if the key is not present? Sorry I cannot be more helpful.
  • codeananda
    codeananda almost 3 years
    This is such a concisely explained answer and makes it completely clear what the scipy docs actually mean!
  • sol
    sol over 2 years
    should be added to the doc. doc writers should understand that a step-by-step and comprehensive example should be provided in the doc for each function!
  • vk1011
    vk1011 over 2 years
    So well explained, thank you! The matrix finally makes sense to me!