create cosine similarity matrix numpy
15,477
Solution 1
let m
be the array
m = np.array([
[ 0.0072427 , 0.00669255, 0.00785213, 0.00845336, 0.01042869],
[ 0.00710799, 0.00668831, 0.00772334, 0.00777796, 0.01049965],
[ 0.00741872, 0.00650899, 0.00772273, 0.00729002, 0.00919407],
[ 0.00717589, 0.00627021, 0.0069514 , 0.0079332 , 0.01069545],
[ 0.00617369, 0.00590539, 0.00738468, 0.00761699, 0.00886915]
])
per wikipedia: Cosine_Similarity
We can calculate our numerator with
d = m.T @ m
Our ‖A‖
is
norm = (m * m).sum(0, keepdims=True) ** .5
Then the similarities are
d / norm / norm.T
[[ 1. 0.9994 0.9979 0.9973 0.9977]
[ 0.9994 1. 0.9993 0.9985 0.9981]
[ 0.9979 0.9993 1. 0.998 0.9958]
[ 0.9973 0.9985 0.998 1. 0.9985]
[ 0.9977 0.9981 0.9958 0.9985 1. ]]
The distances are
1 - d / norm / norm.T
[[ 0. 0.0006 0.0021 0.0027 0.0023]
[ 0.0006 0. 0.0007 0.0015 0.0019]
[ 0.0021 0.0007 0. 0.002 0.0042]
[ 0.0027 0.0015 0.002 0. 0.0015]
[ 0.0023 0.0019 0.0042 0.0015 0. ]]
Solution 2
Let x
be your array
from scipy.spatial.distance import cosine
m, n = x.shape
distances = np.zeros((m,n))
for i in range(m):
for j in range(n):
distances[i,j] = cosine(x[i,:],x[:,j])
Related videos on Youtube
Author by
Sal
Updated on October 14, 2022Comments
-
Sal over 1 year
Suppose I have a numpy matrix like the following:
array([array([ 0.0072427 , 0.00669255, 0.00785213, 0.00845336, 0.01042869]), array([ 0.00710799, 0.00668831, 0.00772334, 0.00777796, 0.01049965]), array([ 0.00741872, 0.00650899, 0.00772273, 0.00729002, 0.00919407]), array([ 0.00717589, 0.00627021, 0.0069514 , 0.0079332 , 0.01069545]), array([ 0.00617369, 0.00590539, 0.00738468, 0.00761699, 0.00886915])], dtype=object)
How can I generate a 5 x 5 matrix where each index of the matrix is the cosine similarity of two corresponding rows in my original matrix?
e.g. row 0 column 2's value would be the cosine similarity between row 1 and row 3 in the original matrix.
Here's what I've tried:
from sklearn.metrics import pairwise_distances from scipy.spatial.distance import cosine import numpy as np #features is a column in my artist_meta data frame #where each value is a numpy array of 5 floating point values, similar to the #form of the matrix referenced above but larger in volume items_mat = np.array(artist_meta['features'].values) dist_out = 1-pairwise_distances(items_mat, metric="cosine")
The above code gives me the following error:
ValueError: setting an array element with a sequence.
Not sure why I'm getting this because each array is of the same length (5), which I've verified.
-
Sal over 7 yearsSure - the matrix in the original post has been updated to reflect the first five rows of the one I am computing. Even on computing the cosine similarity of the first five rows I run into the error.
-
DYZ over 7 yearsSo, as I said before, assuming that
f
is your matrix,1-pairwise_distances(f,metric="cosine")
gives no errors whatsoever.
-
-
Ismael EL ATIFI almost 5 yearsTo optimize your code, you can divide m by norm once before doing m.T @ m. It saves the division by norm.T.
-
Catbuilts about 3 yearsI agree with @IsmaelELATIFI. The optimized code is:
norm = (m * m).sum(0, keepdims=True) ** .5
;m_norm = m/norm
;similarity_matrix = m_norm.T @ m_norm
-
Ivan Gonzalez about 2 yearsJust to add ^^. When you have unit vectors, the cosine distance is the same as just the dot product.
-
kmf about 2 yearsShouldn't it be
m @ m.T
? If I have an M x N matrix, so M vectors each of it N-dimensional, I want to have an M x M distance matrix. [M x N] @ [N x M] = [M x M]. -
Nguai al about 2 yearsper formulae, d / (norm * norm.T) ?