scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

14,235

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

Edit #1: Here is the important part:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

Share:
14,235
László
Author by

László

Updated on June 18, 2022

Comments

  • László
    László almost 2 years

    My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

    The test script is:

    import numpy as np
    import sklearn
    from sklearn.linear_model import SGDClassifier
    from sklearn import grid_search
    import multiprocessing as mp
    
    
    def main():
        print("Started.")
    
        print("numpy:", np.__version__)
        print("sklearn:", sklearn.__version__)
    
        n_samples = 1000000
        n_features = 1000
    
        X_train = np.random.randn(n_samples, n_features)
        y_train = np.random.randint(0, 2, size=n_samples)
    
        print("input data size: %.3fMB" % (X_train.nbytes / 1e6))
    
        model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
        param_grid = [{
            'alpha' : 10.0 ** -np.arange(1,7),
            'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
        }]
        gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
        gs.fit(X_train, y_train)
        print(gs.grid_scores_)
    
    if __name__=='__main__':
        mp.freeze_support()
        main()
    

    This results in the output:

    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Started.
    ('numpy:', '1.8.1')
    ('sklearn:', '0.15.0b1')
    input data size: 8000.000MB
    Fitting 3 folds for each of 48 candidates, totalling 144 fits
    Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
    Failed to save <type 'numpy.ndarray'> to .npy file:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
        obj, filename = self._write_array(obj, filename)
      File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
        self.np.save(filename, array)
      File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
        format.write_array(fid, arr)
      File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
        array.tofile(fp)
    ValueError: 1000000000 requested and 268435456 written
    
    Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Vendor:  Continuum Analytics, Inc.
    Package: mkl
    Message: trial mode expires in 28 days
    Traceback (most recent call last):
      File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
        main()
      File "S:\laszlo\gridsearch_largearray.py", line 28, in main
        gs.fit(X_train, y_train)
      File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
        return self._fit(X, y, ParameterGrid(self.param_grid))
      File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
        for parameters in parameter_iterable
      File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
        self.retrieve()
      File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
        self._output.append(job.get())
      File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
        raise self._value
    struct.error: integer out of range for 'i' format code
    

    EDIT: ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)