Pickling large NumPy array

11,869

Solution 1

With regard to #1, it's a bug… and an old one at that. There's an enlightening, albeit surprisingly old, discussion about this here: http://python.6.x6.nabble.com/test-gzip-test-tarfile-failure-om-AMD64-td1830323.html

The reasons for the error are here: http://www.littleredbat.net/mk/files/grimoire.html#contents_item_2.1

The simplest and most basic type are integers, which are represented as a C long. Their size is therefore dependent on the platform you're using; on a 32-bit machine, they can range from -2147483647 to 2147483647. Python programs can determine the highest possible value for an integer by looking at sys.maxint; the lowest possible value will usually be -sys.maxint - 1.

This error is not a common one, as most people when faced with a very large numpy array, will use np.save or np.savez to take advantage of the reduced pickle format for numpy arrays (see the __reduce__ method for a numpy array, which is what np.save calls under the covers).

To show that it's just about the array being too large for pickle

>>> import numpy as np
>>> import pickle
>>> test_rand = np.random.random((100000,200,50))
>>> x = pickle.dumps(test_rand[:20000], -1)
>>> x = pickle.dumps(test_rand[:30000], -1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 194, in dumps
    dump(obj, file, protocol, byref, fmode)#, strictio)
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 184, in dump
    pik.dump(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 181, in save_numpy_array
    pik.save_reduce(_create_array, (f, args, state, npdict), obj=obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 401, in save_reduce
    save(args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 486, in save_string
    self.write(BINSTRING + pack("<i", n) + obj)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>>> 

however, this works for the full array...

>>> x = test_rand.__reduce__()
>>> type(x)
<type 'tuple'>
>>> x[0]     
<built-in function _reconstruct>
>>> x[1]
(<type 'numpy.ndarray'>, (0,), 'b')
>>> x[2][0:3]
(1, (100000, 200, 50), dtype('float64'))
>>> len(x[2][4])
8000000000
>>> x[2][4][:100]
'Y\xa4}\xdf\x84\xdf\xe1?\xfe\x1fd\xe3\xf2\xab\xe2?\x80\xe4\xfe\x17\xfb\xd6\xc2?\xd73\x92\xc9N]\xe8?\x90\xbc\xe3@\xdcO\xc9?\x18\x9dX\x12MG\xc4?(\x0f\x8f\xf9}\xf6\xb1?\xd0\x90O\xe2\x9b\xf1\xed?_\x99\x06\xacY\x9e\xe2?\xe7\xf8\x15\xa8\x13\x91\xe2?\x96}\xffH\xda\xc3\xd4?@\t\xae_"\xe0\xda?y<%\x8a'

And if you'd like to burn out your fan, print x.

What you'll also notice is the function in x[0] gets saved along with the data. It's a self-contained function that can produce a numpy array from the pickled data.

Solution 2

As an alternative to pickle, especially for very large datasets, you may wish to consider a Python interface to a binary data format such as HDF5 (e.g., h5py). For a discussion of its pros and cons, see this question and the first answer.

Solution 3

To answer the first question, "What is actually going on in this error?", here is my guess.

Pickle is trying to save your NumPy array as packed binary data. It's saving each integer as a four-byte signed integer (the i code). However, numpy.random.random creates floats (which should be eight-byte ds rather than four-byte is). I have no idea why pickle would do it this way. It's also entirely possible that the i actually is for saving some other piece of information than one of the values of your array. I'm just guessing that the error arises because a value of your array does not fit in four bytes.

What versions of Python and NumPy are you using?

Share:
11,869
David Kelley
Author by

David Kelley

Updated on June 04, 2022

Comments

  • David Kelley
    David Kelley about 2 years

    I have a large 3d numpy array that I'd like to preserve. My first approach is simply to use pickle, but this seems to lead to a poorly explained error.

    test_rand = np.random.random((100000,200,50))
    with open('models/test.pkl', 'wb') as save_file:
        pickle.dump(test_rand, save_file, -1)
    
    ---------------------------------------------------------------------------
    error                                     Traceback (most recent call last)
    <ipython-input-18-511e30b08440> in <module>()
          1 with open('models/test.pkl', 'wb') as save_file:
    ----> 2         pickle.dump(test_rand, save_file, -1)
          3 
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in dump(obj, file, protocol)
       1368 
       1369 def dump(obj, file, protocol=None):
    -> 1370     Pickler(file, protocol).dump(obj)
       1371 
       1372 def dumps(obj, protocol=None):
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in dump(self, obj)
        222         if self.proto >= 2:
        223             self.write(PROTO + chr(self.proto))
    --> 224         self.save(obj)
        225         self.write(STOP)
        226 
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
        329 
        330         # Save the reduce() output and finally memoize the object
    --> 331         self.save_reduce(obj=obj, *rv)
        332 
        333     def persistent_id(self, obj):
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
        417 
        418         if state is not None:
    --> 419             save(state)
        420             write(BUILD)
        421 
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
        284         f = self.dispatch.get(t)
        285         if f:
    --> 286             f(self, obj) # Call unbound method with explicit self
        287             return
        288 
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_tuple(self, obj)
        560         write(MARK)
        561         for element in obj:
    --> 562             save(element)
        563 
        564         if id(obj) in memo:
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
        284         f = self.dispatch.get(t)
        285         if f:
    --> 286             f(self, obj) # Call unbound method with explicit self
        287             return
        288 
    
    C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_string(self, obj, pack)
        484                 self.write(SHORT_BINSTRING + chr(n) + obj)
        485             else:
    --> 486                 self.write(BINSTRING + pack("<i", n) + obj)
        487         else:
        488             self.write(STRING + repr(obj) + '\n')
    
    error: integer out of range for 'i' format code
    

    So the two questions I have are as follows:

    • What is actually going on in this error?
    • How should I go about saving the array to disk?

    I am using Python 2.7.8 and NumPy 1.9.0.