Efficient way to normalize a Scipy Sparse Matrix
Solution 1
This has been implemented in scikit-learn sklearn.preprocessing.normalize.
from sklearn.preprocessing import normalize
w_normalized = normalize(w, norm='l1', axis=1)
axis=1
should normalize by rows, axis=0
to normalize by column. Use the optional argument copy=False
to modify the matrix in place.
Solution 2
While Aarons answer is correct, I implemented a solution when I wanted to normalize with respect to the maximum of the absolute values, which sklearn is not offering. My method uses the nonzero entries and finds them in the csr_matrix.data array to replace values there quickly.
def normalize_sparse(csr_matrix):
nonzero_rows = csr_matrix.nonzero()[0]
for idx in np.unique(nonzero_rows):
data_idx = np.where(nonzero_rows==idx)[0]
abs_max = np.max(np.abs(csr_matrix.data[data_idx]))
if abs_max != 0:
csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]
In contrast to sunan's solution, this method does not require any casting of the matrix into dense format (which could raise memory problems) and no matrix multiplications either. I tested the method on a sparse matrix of shape (35'000, 486'000) and it took ~ 18 seconds.
Solution 3
here is my solution.
- transpose A
- calculate sum of each col
- format diagonal matrix B with reciprocal of sum
- A*B equals normalization
-
transpose C
import scipy.sparse as sp import numpy as np import math minf = 0.0001 A = sp.lil_matrix((5,5)) b = np.arange(0,5) A.setdiag(b[:-1], k=1) A.setdiag(b) print A.todense() A = A.T print A.todense() sum_of_col = A.sum(0).tolist() print sum_of_col c = [] for i in sum_of_col: for j in i: if math.fabs(j)<minf: c.append(0) else: c.append(1/j) print c B = sp.lil_matrix((5,5)) B.setdiag(c) print B.todense() C = A*B print C.todense() C = C.T print C.todense()
Solution 4
I found this as an elegant way of doing it without using inbuilt functions.
import scipy.sparse as sp
def normalize(W):
#Find the row scalars as a Matrix_(n,1)
rowSumW = sp.csr_matrix(W.sum(axis=1))
rowSumW.data = 1/rowSumW.data
#Find the diagonal matrix to scale the rows
rowSumW = rowSumW.transpose()
scaling_matrix = sp.diags(rowSumW.toarray()[0])
return scaling_matrix.dot(W)
sterne
And through it all, it offers me protection. A lot of love and affection. I'm loving Python instead.
Updated on December 17, 2020Comments
-
sterne over 3 years
I'd like to write a function that normalizes the rows of a large sparse matrix (such that they sum to one).
from pylab import * import scipy.sparse as sp def normalize(W): z = W.sum(0) z[z < 1e-6] = 1e-6 return W / z[None,:] w = (rand(10,10)<0.1)*rand(10,10) w = sp.csr_matrix(w) w = normalize(w)
However this gives the following exception:
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 325, in __div__ return self.__truediv__(other) File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 230, in __truediv__ raise NotImplementedError
Are there any reasonably simple solutions? I have looked at this, but am still unclear on how to actually do the division.
-
Leo almost 9 yearsNote that if you normalize by features (axis=0) then the returned matrix is of type 'csc' even if w was a 'csr'. This may be unpleasant if you counted on it being a 'csr'
-
Abdus Khazi over 4 yearsI think doing so many transpose operations can be avoided. All we have to do is to scale the rows instead of the columns. stackoverflow.com/a/59365444/6286927
-
Tommaso Guerrini almost 4 yearsthe other thing is that if one has train and test set how does one 'fit' the normalization on train set?
-
Anton Antonov over 2 yearsWhat is the fill-in ratio of the (35'000, 486'000) matrix you did tests with?
-
AlexConfused over 2 years1,657,114 entries were non-zero which results in a ratio of ~0.0096% in my case.