Use Distance Matrix in scipy.cluster.hierarchy.linkage()?

26,165

Solution 1

It seems that indeed we cannot directly pass the redundant square matrix in, although the documentation claims we can do so.

To benefit anyone who faces the same problem in the future, I write my solution as an additional answer here. So the copy-and-paste guys can just proceed with the clustering.

Use the following snippet to condense the matrix and happily proceed.

import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
    distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j

Please correct me if I am wrong.

Solution 2

For now you should pass in the 'condensed distance matrix', i.e. just the upper triangle of the distance matrix in vector form:

y = M[np.triu_indices(n,1)]

From the discussion of @hongbo-zhu-cn's pull request it looks as though the solution will be to add an extra keyword argument to the linkage function that will allow the user to explicitly specify that they are passing in an n x n distance matrix rather than an m x n observation matrix.

Share:
26,165

Related videos on Youtube

Sibbs Gambling
Author by

Sibbs Gambling

Updated on July 09, 2022

Comments

  • Sibbs Gambling
    Sibbs Gambling almost 2 years

    I have a distance matrix n*n M where M_ij is the distance between object_i and object_j. So as expected, it takes the following form:

       /  0     M_01    M_02    ...    M_0n\
       | M_10    0      M_12    ...    M_1n |
       | M_20   M_21     0      ...    M2_n |
       |                ...                 |
       \ M_n0   M_n2    M_n2    ...      0 / 
    

    Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean').

    Its documentation says:

    y must be a {n \choose 2} sized vector where n is the number of original observations paired in the distance matrix.

    y : ndarray

    A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.

    I am confused by this description of y. Can I directly feed my M in as the input y?


    Update

    @hongbo-zhu-cn has raised this issue up in GitHub. This is exactly what I am concerning about. However, as a newbie to GitHub, I don't know how it works and therefore have no idea how this issue is dealt with.

    • HongboZhu
      HongboZhu over 10 years
      you can either checkout your own copy of scipy source code and update the linkage() function in scipy/cluster/hierarchy.py and compile your own version of Scipy, or simply avoid giving redundant distance matrix as input to linkage(). (use squareform() to convert it to condensed form).
    • Sibbs Gambling
      Sibbs Gambling over 10 years
      @HongboZhu Oh, yes, I nearly forgot that I could do this!
  • HongboZhu
    HongboZhu over 10 years
    there is a scipy function scipy.spatial.distance.squareform() doing the conversion: docs.scipy.org/doc/scipy/reference/generated/…
  • HongboZhu
    HongboZhu over 10 years
    I think it is the way to go for the moment. You can compare your results obtained by using condensed distance matrix as input and results obtained by using observations as input.