Cosine similarity between each row in a Dataframe in Python

python pandas dataframe scikit-learn

35,584

You can directly just use sklearn.metrics.pairwise.cosine_similarity.

Demo

import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.randint(0, 2, (3, 5)))

df
##     0  1  2  3  4
##  0  1  1  1  0  0
##  1  0  0  1  1  1
##  2  0  1  0  1  0

cosine_similarity(df)
##  array([[ 1.        ,  0.33333333,  0.40824829],
##         [ 0.33333333,  1.        ,  0.40824829],
##         [ 0.40824829,  0.40824829,  1.        ]])

35,584

Author by

Jayanth Prakash Kulkarni

Project Assistant at IISc. Undergrad, MSRIT. Interested in Machine Learning, Reinforcement Learning and Game theory.

Updated on July 09, 2022

Comments

Jayanth Prakash Kulkarni almost 2 years
I have a DataFrame containing multiple vectors each having 3 entries. Each row is a vector in my representation. I needed to calculate the cosine similarity between each of these vectors. Converting this to a matrix representation is better or is there a cleaner approach in DataFrame itself?

Here is the code that I have tried.
```
import pandas as pd
from scipy import spatial
df = pd.DataFrame([X,Y,Z]).T
similarities = df.values.tolist()

for x in similarities:
    for y in similarities:
        result = 1 - spatial.distance.cosine(x, y)
```