Reverse sort and argsort in python

34,559

I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).

The double sort can be skipped by doing

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is

v = (tfidf * tfidf[idx, :]).transpose()
Share:
34,559
tdc
Author by

tdc

Machine Learning Researcher

Updated on September 28, 2020

Comments

  • tdc
    tdc over 3 years

    I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:

    • Compute vector of inner products between doc idx and all other documents
    • Sort in descending order
    • Return the "scores" and indices from the second one to the end (i.e. not itself)

    The code I have at the moment is:

    import h5py
    import numpy as np
    
    def get_related(tfidf, idx) :
        ''' return the top documents '''
    
        # calculate inner product   
        v = np.inner(tfidf, tfidf[idx].transpose())
    
        # sort
        vs = np.sort(v.toarray(), axis=0)[::-1]
        scores = vs[1:,]
    
        # sort indices
        vi = np.argsort(v.toarray(), axis=0)[::-1]
        idxs = vi[1:,] 
    
        return (scores, idxs)
    

    where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.

    This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.

    • Can this be done more efficiently?
    • Can this be done without converting the sparse matrix using toarray()?
  • tdc
    tdc over 12 years
    Thanks for the swift response. Just wondering, do you know how the toarray() function works - I take it that it doesn't make a copy of the data
  • Fred Foo
    Fred Foo over 12 years
    @tdc: it does make a copy. And it fills in the zero positions.
  • Fred Foo
    Fred Foo over 12 years
    @tdc: I just realised that there's one more important optimization to make: you should be using CSR sparse matrices. In any other representation, the inner product computation will be suboptimal.
  • tdc
    tdc over 12 years
    1) can I do things like sorting without making a copy? 2) how expensive is the translation from csc to csr?
  • Fred Foo
    Fred Foo over 12 years
    1) Not that I know. 2) Very cheap. I believe it's just a matter of rearranging some indices, without the data being actually copied.
  • ely
    ely over 11 years
    There is a parentheses error in the final line of this and I can't tell what your mean. Does the transpose apply to the indexed inner array, or to the result after multiplying them? Is the last closing parentheses on the right just a typo, or not?