Clustering cosine similarity matrix

python math scikit-learn cluster-analysis data-mining

21,483

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather an easy algorithm.

Here is a piece of code doing it in python using sklearn:

import numpy as np
from sklearn.cluster import SpectralClustering
mat = np.matrix([[1.,.1,.6,.4],[.1,1.,.1,.2],[.6,.1,1.,.7],[.4,.2,.7,1.]])
SpectralClustering(2).fit_predict(mat)
>>> array([0, 1, 0, 0], dtype=int32)

As you can see it returns the clustering you have mentioned.

The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues, then runs the k-mean algorithm on the new matrix. Here is a simple code that does this for your matrix:

from sklearn.cluster import KMeans
eigen_values, eigen_vectors = np.linalg.eigh(mat)
KMeans(n_clusters=2, init='k-means++').fit_predict(eigen_vectors[:, 2:4])
>>> array([0, 1, 0, 0], dtype=int32)

Note that the implementation of the algorithm in the sklearn library may differ from mine. The example I gave is the simplest way of doing it. There are some good tutorial available online describing the spectral clustering algorithm in depth.

For the cases you want the algorithm to figure out the number of clusters by itself, you can use Density Based Clustering Algorithms like DBSCAN:

from sklearn.cluster import DBSCAN
DBSCAN(min_samples=1).fit_predict(mat)
array([0, 1, 2, 2])

21,483

Author by

Stefan D

Updated on August 22, 2020

Comments

Stefan D over 3 years
A few questions on stackoverflow mention this problem, but I haven't found a concrete solution.

I have a square matrix which consists of cosine similarities (values between 0 and 1), for example:
```
  |  A  |  B  |  C  |  D
A | 1.0 | 0.1 | 0.6 |  0.4
B | 0.1 | 1.0 | 0.1 |  0.2
C | 0.6 | 0.1 | 1.0 |  0.7
D | 0.4 | 0.2 | 0.7 |  1.0
```
The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters:
1. B
2. A, C, D
The reason being because C & D have the highest value between them, and A & C also have the highest value between them.

An item can be in only one cluster.

Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element.

I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision.

I will have to implement this in Python.
Stefan D about 9 years

Both the KMeans algorithm and SpectralClustering assume that the number of clusters is known. In my problem the number of clusters is not known, and cannot be estimated reliably. But thanks for pointing me to the sklearn clustering algorithms. I tried them all, Affinity Propagation gives the best results. I may try to optimize it or try to create a Python module for FLAME clustering: en.wikipedia.org/wiki/FLAME_clustering
Ashkan about 9 years

I see. you want to do the clustering without specifying the number of clusters. I will add one other example of such clustering algorithms to my answer now other than the Affinity Propagation your are using just in case.
Ashkan about 9 years

Also, could you tell me why the diagonal is confusing you? It is reasonable that the similarity of an element to itself be the maximum value 1.
Stefan D about 9 years

Thank you! This worked, although not as straightforward. DBSCAN assumes distance between items, while cosine similarity is the exact opposite. To make it work I had to convert my cosine similarity matrix to distances (i.e. subtract from 1.00). Then I had to tweak the eps parameter. It achieves OK results now.
user2253546 about 7 years

@Leo-T Just a small correction to your post concerning how spectral clustering works. It is stated that "The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues ...", however, it does just the opposite. It returns the first k eigenvectors of the Laplacian corresponding to the k smallest eigenvalues.
Admin over 6 years

@StefanD I also have a similar problem like yours. Thanks for the wonderful question. Can't we use cosine similarity matrix straightly for DBSCAN, without using cosine distance?
Ancalagon BerenLuthien about 6 years

0 down vote I wish to use a specific cut-off-threshold of cosine similarity to cluster the data. Is there any lib that can do this ? For example, if two vectors cosine similarity < 0.1, then put them into the same cluster.
Ashkan over 5 years

@user2253546 are you sure? I am certain it is the largest eigenvalues! Can you point me to some resources?
Dingkun Liu over 2 years

I think the spectral cluster uses the Laplacian matrix which is the degree matrix minus the affinity matrix to compute eigenvectors. So in this case you should zero out the diagonal of the affinity.