Efficient way to normalize a Scipy Sparse Matrix

22,947

Solution 1

This has been implemented in scikit-learn sklearn.preprocessing.normalize.

from sklearn.preprocessing import normalize
w_normalized = normalize(w, norm='l1', axis=1)

axis=1 should normalize by rows, axis=0 to normalize by column. Use the optional argument copy=False to modify the matrix in place.

Solution 2

While Aarons answer is correct, I implemented a solution when I wanted to normalize with respect to the maximum of the absolute values, which sklearn is not offering. My method uses the nonzero entries and finds them in the csr_matrix.data array to replace values there quickly.

def normalize_sparse(csr_matrix):
    nonzero_rows = csr_matrix.nonzero()[0]
    for idx in np.unique(nonzero_rows):
        data_idx = np.where(nonzero_rows==idx)[0]
        abs_max = np.max(np.abs(csr_matrix.data[data_idx]))
        if abs_max != 0:
            csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]

In contrast to sunan's solution, this method does not require any casting of the matrix into dense format (which could raise memory problems) and no matrix multiplications either. I tested the method on a sparse matrix of shape (35'000, 486'000) and it took ~ 18 seconds.

Solution 3

here is my solution.

  • transpose A
  • calculate sum of each col
  • format diagonal matrix B with reciprocal of sum
  • A*B equals normalization
  • transpose C

    import scipy.sparse as sp
    import numpy as np
    import math
    
    minf = 0.0001
    
    A = sp.lil_matrix((5,5))
    b = np.arange(0,5)
    A.setdiag(b[:-1], k=1)
    A.setdiag(b)
    print A.todense()
    A = A.T
    print A.todense()
    
    sum_of_col = A.sum(0).tolist()
    print sum_of_col
    c = []
    for i in sum_of_col:
        for j in i:
            if math.fabs(j)<minf:
                c.append(0)
            else:
                c.append(1/j)
    
    print c
    
    B = sp.lil_matrix((5,5))
    B.setdiag(c)
    print B.todense()
    
    C = A*B
    print C.todense()
    C = C.T
    print C.todense()
    

Solution 4

I found this as an elegant way of doing it without using inbuilt functions.

import scipy.sparse as sp

def normalize(W):
    #Find the row scalars as a Matrix_(n,1)
    rowSumW = sp.csr_matrix(W.sum(axis=1))
    rowSumW.data = 1/rowSumW.data

    #Find the diagonal matrix to scale the rows
    rowSumW = rowSumW.transpose()
    scaling_matrix = sp.diags(rowSumW.toarray()[0])

    return scaling_matrix.dot(W)  
Share:
22,947
sterne
Author by

sterne

And through it all, it offers me protection. A lot of love and affection. I'm loving Python instead.

Updated on December 17, 2020

Comments

  • sterne
    sterne over 3 years

    I'd like to write a function that normalizes the rows of a large sparse matrix (such that they sum to one).

    from pylab import *
    import scipy.sparse as sp
    
    def normalize(W):
        z = W.sum(0)
        z[z < 1e-6] = 1e-6
        return W / z[None,:]
    
    w = (rand(10,10)<0.1)*rand(10,10)
    w = sp.csr_matrix(w)
    w = normalize(w)
    

    However this gives the following exception:

    File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 325, in __div__
         return self.__truediv__(other)
    File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 230, in  __truediv__
       raise NotImplementedError
    

    Are there any reasonably simple solutions? I have looked at this, but am still unclear on how to actually do the division.

  • Leo
    Leo almost 9 years
    Note that if you normalize by features (axis=0) then the returned matrix is of type 'csc' even if w was a 'csr'. This may be unpleasant if you counted on it being a 'csr'
  • Abdus Khazi
    Abdus Khazi over 4 years
    I think doing so many transpose operations can be avoided. All we have to do is to scale the rows instead of the columns. stackoverflow.com/a/59365444/6286927
  • Tommaso Guerrini
    Tommaso Guerrini almost 4 years
    the other thing is that if one has train and test set how does one 'fit' the normalization on train set?
  • Anton Antonov
    Anton Antonov over 2 years
    What is the fill-in ratio of the (35'000, 486'000) matrix you did tests with?
  • AlexConfused
    AlexConfused over 2 years
    1,657,114 entries were non-zero which results in a ratio of ~0.0096% in my case.