Performing PCA on large sparse matrix by using sklearn

python scikit-learn sparse-matrix pca svd

30,609

Solution 1

Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as non-zeros.

>>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

Now, check the results:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

which suggests that 95000 of the entries are non-zero, however,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481 elements are close to 0 (<1e-15), but not 0.

Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.

I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.

The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):

>>> clf.explained_variance_ratio_.sum()

Solution 2

PCA(X) is SVD(X-mean(X)). Even If X is a sparse matrix, X-mean(X) is always a dense matrix. Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix. However, delayed evaluation

delay(X-mean(X))

can avoid expanding the sparse matrix X to the dense matrix X-mean(X). The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.

This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/

You can see the code of the PCA using this mechanism : https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py

Performance comparisons to existing methods show this mechanism drastically reduces required memory size : https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh

More detail description of this technique can be found in my patent : https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

30,609

khassan

Updated on November 13, 2020

Comments

khassan over 3 years

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix

However, I always get error. Can someone point out what I am doing wrong.

Input matrix 'X_train' contains numbers in float64:

>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim 
2
>>>X_train[0]     
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
    with 81 stored elements in Compressed Sparse Row format>

I am trying to do:

>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
    self._fit(check_array(X))
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
    copy, force_all_finite)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I try to convert to dense matrix, I think I am out of memory.

>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

cel over 8 years

Did you see this answer in the question you have linked? stackoverflow.com/a/10721425/2272172
khassan over 8 years

yes, but I want to know if there is a way to apply PCA on huge sparse matrix (if possible by using python and sklearn)
cel over 8 years

So you already used TruncatedSVD and it did not work? If so please document that in your question as well.
khassan over 8 years

TruncatedSVD works if I set small n_components, ex 100, but if I set it to 1,000,000 it fails.
khassan over 8 years

In fact, even setting n_components = 3000 for TruncatedSVD is giving MemoryError.
cel over 8 years

I am not familiar enough with the underlying math. Are you sure that you can actually calculate a full PCA in the sparse case? I mean that would imply that everything that happens during the transformation stays sparse. This seems like a bold assumption.
khassan over 8 years

@cel it would be great if it is possible, if I can build PCA of whole matrix then I could select eigenvectors covering 99% or 99.99% of variance.

khassan over 8 years

I see, I was able to decrease number of features from 1.6M to 500 (just enough to fit inside the memory). Looks like it is impossible to perform SVD on top of huge matrix unless you have very large RAM.
Ali Pardhan almost 2 years

currently with sklearn, we can do pca = PCA(0.99) where 0.99 is the total explained_variance_ratio_