cosine similarity on large sparse matrix with numpy

10,579

Solution 1

Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1]
    ret = np.ndarray((m1.shape[0], m2.shape[0]))
    for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])
        if end <= start:
            break # cause I'm too lazy to elegantly handle edge cases
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2) # rows is O(1) size
        ret[start: end] = sim
    return ret

No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.

Stupid yet effective sanity check:

import random
while True:
    m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
    n = np.random.rand(random.randint(1, 100), m.shape[1])
    assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))

Solution 2

I would run it in chunks like this

from sklearn.metrics.pairwise import cosine_similarity

# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster 
chunk_size = 500 
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray

def similarity_cosine_by_chunk(start, end):
    if end > matrix_len:
        end = matrix_len
    return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function

for chunk_start in xrange(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    # Handle cosine_similarity_chunk  ( Write it to file_timestamp and close the file )
    # Do not open the same file again or you may end up with out of memory after few chunks 

Solution 3

You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix to actually compress your data. In fact, there are almost surely no zeros at all.

If you look closely at your MemoryError traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:

MemoryError                  Traceback (most recent call last)
    887         Y_normalized = normalize(Y, copy=True)
    888 
--> 889     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
    890 
    891     return K

So the problem isn't with cosine_similarity, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:

>>> a = np.zeros((65000, 10))
>>> i = np.random.rand(a.size)
>>> a.flat[i < 0.01] = 1        # Select 1% of indices and set to 1
>>> a = sparse.csr_matrix(a)

Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:

>>> b = cosine_similarity(a)
>>> b
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
Share:
10,579
Sal
Author by

Sal

Updated on June 06, 2022

Comments

  • Sal
    Sal almost 2 years

    The code below causes my system to run out of memory before it completes.

    Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?

    I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.

    import numpy as np
    from scipy import sparse
    from sklearn.metrics.pairwise import cosine_similarity
    
    mat = np.random.rand(65000, 10)
    
    sparse_mat = sparse.csr_matrix(mat)
    
    similarities = cosine_similarity(sparse_mat)
    

    After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.

  • Sal
    Sal over 7 years
    my mistake - i didn't intend for it to be a sparse matrix. i intended to take my original matrix (mat) and use that to compute a 65000 x 65000 matrix where each element represents the cosine similarity between two rows. i've updated my question to reflect this change. do you know how I can compute the cosine similarities on all of the rows with respect to each other on the original matrix?
  • Praveen
    Praveen over 7 years
    @Sal Well, a 65000x65000 matrix has a size of about 31.5GiB, so on a 64GB machine, you should be fine. But you wouldn't be able to store two such matrices. So if python tries to copy the matrix during its construction, you'll be in trouble. Those instances where the computer seems to freeze are probably cases where it's swapping (or paging). Maybe you should just let it run for a while, say overnight... as far as computing the matrix is concerned. But to actually do anything with the matrix, you'll have a hard time.
  • MERose
    MERose over 6 years
    For Python3 replace xrange by range.
  • Shivam Agrawal
    Shivam Agrawal over 5 years
    I am confused with this line cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size) why we are overwriting cosine_similarity_chunk variable again and again ?
  • Yogesh Yadav
    Yogesh Yadav over 4 years
    @ShivamAgrawal . That line is calculating cosine similarity in chunks. With every pass you would have cosine similarity for that chunk only.
  • igorkf
    igorkf over 3 years
    This is a solution that worked very well for me. Thanks.