Efficiently calculate cosine similarity using scikit-learn

14,768

Solution 1

To improve performance you should replace the list comprehensions by vectorized code. This can be easily implemented through Numpy's pdist and squareform as shown in the snippet below:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

Demo:

In [87]: X
Out[87]: 
<9x21 sparse matrix of type '<type 'numpy.int64'>'
    with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()          
Out[88]: 
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]: 
[u'attack',
 u'awakens',
 u'back',
 u'clones',
 u'empire',
 u'force',
 u'hope',
 u'jedi',
 u'last',
 u'menace',
 u'new',
 u'of',
 u'phantom',
 u'return',
 u'revenge',
 u'sith',
 u'star',
 u'story',
 u'strikes',
 u'the',
 u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0.    1.    1.    1.    1.    1.    1.    1.    1.  ]
 [ 1.    0.    0.75  0.71  0.75  0.75  0.71  1.    0.71]
 [ 1.    0.75  0.    0.71  0.5   0.5   0.71  1.    0.42]
 [ 1.    0.71  0.71  0.    0.71  0.71  0.67  1.    0.67]
 [ 1.    0.75  0.5   0.71  0.    0.5   0.71  1.    0.71]
 [ 1.    0.75  0.5   0.71  0.5   0.    0.71  1.    0.71]
 [ 1.    0.71  0.71  0.67  0.71  0.71  0.    1.    0.67]
 [ 1.    1.    1.    1.    1.    1.    1.    0.    1.  ]
 [ 1.    0.71  0.42  0.67  0.71  0.71  0.67  1.    0.  ]]

Notice that X.toarray().shape yields (9L, 21L) because in the toy example above there are 9 titles and 21 different words, whereas cs_title is a 9 by 9 array.

Solution 2

You can reduce the effort for each of the calculations by over half by taking into account two characteristics of the cosine similarity of two vectors:

  1. The cosine similarity of a vector with itself is one.
  2. The cosine similarity of vector x with vector y is the same as the cosine similarity of vector y with vector x.

Therefore, calculate either the elements above the diagonal or below.

EDIT: Here's how you could calculate it. Note especially that cs is just a dummy function to take the place of a real calculation of the similarity coefficient.

title1 = 'A four word title'
title2 = 'A five word title'
title3 = 'A six word title'
title4 = 'A seven word title'

titles = [title1, title2, title3, title4]
N = len(titles)

import numpy as np

similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N)

cs = lambda a,b: 10*a+b  # just a 'pretend' calculation of the coefficient

for m in range(N):
    similarity_matrix [m,m] = 1
    for n in range(m+1,N):
        similarity_matrix [m,n] = cs(m,n)
        similarity_matrix [n,m] = similarity_matrix [m,n]

print (similarity_matrix )

Here's the result.

[[  1.   1.   2.   3.]
 [  1.   1.  12.  13.]
 [  2.  12.   1.  23.]
 [  3.  13.  23.   1.]]
Share:
14,768
user7347576
Author by

user7347576

Updated on June 05, 2022

Comments

  • user7347576
    user7347576 about 2 years

    After preprocessing and transforming (BOW, TF-IDF) data I need to calculate its cosine similarity with each other element of the dataset. Currently, I do this:

    cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
    cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
    cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
    cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]
    

    In this example, each input variable, eg tr_title, is a SciPy sparse matrix. However, this code runs extremely slowly. What can I do to optimise the code so it runs more quickly?

  • user7347576
    user7347576 over 7 years
    I considered this but was not sure how to implement it to product the equivalent output?
  • Bill Bell
    Bill Bell over 7 years
    The code you included in you question, for instance, cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt] will produce a vector. But wouldn't you want a matrix for each collection of cosine similarities?