Performing PCA on large sparse matrix by using sklearn
Solution 1
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0
(<1e-15
), but not 0
.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8
) components from your matrix, you will end up with a 1e8 x n_features
(in your example 1e8 x 1617899
) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
Solution 2
PCA(X) is SVD(X-mean(X)). Even If X is a sparse matrix, X-mean(X) is always a dense matrix. Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix. However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X). The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism : https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size : https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent : https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312
Related videos on Youtube
khassan
Updated on November 13, 2020Comments
-
khassan over 3 years
I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train) <class 'scipy.sparse.csr.csr_matrix'> >>>X_train.shape (2365436, 1617899) >>>X_train.ndim 2 >>>X_train[0] <1x1617899 sparse matrix of type '<type 'numpy.float64'>' with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA >>>pca = RandomizedPCA() >>>pca.fit(X_train) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit self._fit(check_array(X)) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array copy, force_all_finite) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray return self.tocoo(copy=False).toarray(order=order, out=out) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray B = self._process_toarray_args(order, out) File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order) MemoryError
-
cel over 8 yearsDid you see this answer in the question you have linked? stackoverflow.com/a/10721425/2272172
-
khassan over 8 yearsyes, but I want to know if there is a way to apply PCA on huge sparse matrix (if possible by using python and sklearn)
-
cel over 8 yearsSo you already used
TruncatedSVD
and it did not work? If so please document that in your question as well. -
khassan over 8 yearsTruncatedSVD works if I set small n_components, ex 100, but if I set it to 1,000,000 it fails.
-
khassan over 8 yearsIn fact, even setting n_components = 3000 for TruncatedSVD is giving MemoryError.
-
cel over 8 yearsI am not familiar enough with the underlying math. Are you sure that you can actually calculate a full PCA in the sparse case? I mean that would imply that everything that happens during the transformation stays sparse. This seems like a bold assumption.
-
khassan over 8 years@cel it would be great if it is possible, if I can build PCA of whole matrix then I could select eigenvectors covering 99% or 99.99% of variance.
-
-
khassan over 8 yearsI see, I was able to decrease number of features from 1.6M to 500 (just enough to fit inside the memory). Looks like it is impossible to perform SVD on top of huge matrix unless you have very large RAM.
-
Ali Pardhan almost 2 yearscurrently with sklearn, we can do
pca = PCA(0.99)
where 0.99 is the totalexplained_variance_ratio_