Python MemoryError when doing fitting with Scikit-learn

33,032

Solution 1

Take a look at this part of your stack trace:

    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:

183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes

Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?

First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.

Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd here, so you might be able to use scipy.sparse.linalg.svds instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.

Solution 2

The relevant option here is gcv_mode. It can take 3 values: "auto", "svd" and "eigen". By default, it is set to "auto", which has the following behavior: use the svd mode if n_samples > n_features, otherwise use the eigen mode.

Since in your case n_samples > n_features, the svd mode is chosen. However, the svd mode currently doesn't handle sparse data properly. scikit-learn should be fixed to use proper sparse SVD instead of the dense SVD.

As a workaround, I would force the eigen mode by gcv_mode="eigen", since this mode should properly handle sparse data. However, n_samples is quite large in your case. Since the eigen mode builds a kernel matrix (and thus has n_samples ** 2 memory complexity), the kernel matrix may not fit in memory. In that case, I would just reduce the number of samples (the eigen mode can handle very large number of features without problem, though).

In any case, since both n_samples and n_features are quite large, you are pushing this implementation to its limits (even with a proper sparse SVD).

Also see https://github.com/scikit-learn/scikit-learn/issues/1921

Share:
33,032
Nyxynyx
Author by

Nyxynyx

Hello :) I have no formal education in programming :( And I need your help! :D These days its web development: Node.js Meteor.js Python PHP Laravel Javascript / jQuery d3.js MySQL PostgreSQL MongoDB PostGIS

Updated on May 04, 2020

Comments

  • Nyxynyx
    Nyxynyx almost 4 years

    I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs fine.

    Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas) for the fitting, I run into the MemoryError error shown below on the line rr.fit(X_train, y_train) that executes the fitting procedure.

    How can I prevent this error?

    Code snippet

    def fit(X_train, y_train):
        alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]
    
        rr = RidgeCV(alphas=alphas)
        rr.fit(X_train, y_train)
    
        return rr
    
    
    rr = fit(X_train, y_train)
    

    Error

    MemoryError                               Traceback (most recent call last)
    <ipython-input-41-a433716e7179> in <module>()
          1 # Fit Training set
    ----> 2 rr = fit(X_train, y_train)
    
    <ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
          3 
          4     rr = RidgeCV(alphas=alphas)
    ----> 5     rr.fit(X_train, y_train)
          6 
          7     return rr
    
    C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
        696                                   gcv_mode=self.gcv_mode,
        697                                   store_cv_values=self.store_cv_values)
    --> 698             estimator.fit(X, y, sample_weight=sample_weight)
        699             self.alpha_ = estimator.alpha_
        700             if self.store_cv_values:
    
    C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
        608             raise ValueError('bad gcv_mode "%s"' % gcv_mode)
        609 
    --> 610         v, Q, QT_y = _pre_compute(X, y)
        611         n_y = 1 if len(y.shape) == 1 else y.shape[1]
        612         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
    
    C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in _pre_compute_svd(self, X, y)
        531     def _pre_compute_svd(self, X, y):
        532         if sparse.issparse(X) and hasattr(X, 'toarray'):
    --> 533             X = X.toarray()
        534         U, s, _ = np.linalg.svd(X, full_matrices=0)
        535         v = s ** 2
    
    C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
        559     def toarray(self, order=None, out=None):
        560         """See the docstring for `spmatrix.toarray`."""
    --> 561         return self.tocoo(copy=False).toarray(order=order, out=out)
        562 
        563     ##############################################################
    
    C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
        236     def toarray(self, order=None, out=None):
        237         """See the docstring for `spmatrix.toarray`."""
    --> 238         B = self._process_toarray_args(order, out)
        239         fortran = int(B.flags.f_contiguous)
        240         if not fortran and not B.flags.c_contiguous:
    
    C:\Python27\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
        633             return out
        634         else:
    --> 635             return np.zeros(self.shape, dtype=self.dtype, order=order)
        636 
        637 
    
    MemoryError: 
    

    Code

    print type(X_train)
    print X_train.shape
    

    Result

    <class 'scipy.sparse.csr.csr_matrix'>
    (183576, 101507)