Pickle File too large to load

29,511

Solution 1

Looks like you're in a bit of a pickle! ;-). Hopefully after this, you'll NEVER USE PICKLE EVER. It's just not a very good data storage format.

Anyways, for this answer I'm assuming your Document class looks a bit like this. If not, comment with your actual Document class:

class Document(object): # <-- object part is very important! If it's not there, the format is different!
    def __init__(self, title, date, text): # assuming all strings
        self.title = title
        self.date = date
        self.text = text

Anyways, I made some simple test data with this class:

d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]

Pickled it with format 2 (pickle.HIGHEST_PROTOCOL for Python 2.x)

>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

And disassembled it with pickletools:

>>> pickletools.dis(s)
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: c        GLOBAL     '__main__ Document'
   25: q        BINPUT     1
   27: )        EMPTY_TUPLE
   28: \x81     NEWOBJ
   29: q        BINPUT     2
   31: }        EMPTY_DICT
   32: q        BINPUT     3
   34: (        MARK
   35: U            SHORT_BINSTRING 'date'
   41: q            BINPUT     4
   43: U            SHORT_BINSTRING '1/1/1'
   50: q            BINPUT     5
   52: U            SHORT_BINSTRING 'text'
   58: q            BINPUT     6
   60: U            SHORT_BINSTRING 'foo is good'
   73: q            BINPUT     7
   75: U            SHORT_BINSTRING 'title'
   82: q            BINPUT     8
   84: U            SHORT_BINSTRING 'foo'
   89: q            BINPUT     9
   91: u            SETITEMS   (MARK at 34)
   92: b        BUILD
   93: h        BINGET     1
   95: )        EMPTY_TUPLE
   96: \x81     NEWOBJ
   97: q        BINPUT     10
   99: }        EMPTY_DICT
  100: q        BINPUT     11
  102: (        MARK
  103: h            BINGET     4
  105: U            SHORT_BINSTRING '2/2/2'
  112: q            BINPUT     12
  114: h            BINGET     6
  116: U            SHORT_BINSTRING 'bar is better'
  131: q            BINPUT     13
  133: h            BINGET     8
  135: U            SHORT_BINSTRING 'bar'
  140: q            BINPUT     14
  142: u            SETITEMS   (MARK at 102)
  143: b        BUILD
  144: h        BINGET     1
  146: )        EMPTY_TUPLE
  147: \x81     NEWOBJ
  148: q        BINPUT     15
  150: }        EMPTY_DICT
  151: q        BINPUT     16
  153: (        MARK
  154: h            BINGET     4
  156: U            SHORT_BINSTRING '3/3/3'
  163: q            BINPUT     17
  165: h            BINGET     6
  167: U            SHORT_BINSTRING 'no one likes baz :('
  188: q            BINPUT     18
  190: h            BINGET     8
  192: U            SHORT_BINSTRING 'baz'
  197: q            BINPUT     19
  199: u            SETITEMS   (MARK at 153)
  200: b        BUILD
  201: e        APPENDS    (MARK at 5)
  202: .    STOP

Looks complex! But really, it's not so bad. pickle is basically a stack machine, each ALL_CAPS identifier you see is an opcode, which manipulates the internal "stack" in some way for decoding. If we were trying to parse some complex structure, this would be more important, but luckily we're just making a simple list of essentially-tuples. All this "code" is doing is constructing a bunch of objects on the stack, and then pushing the entire stack into a list.

The one thing we DO need to care about are the 'BINPUT' / 'BINGET' opcodes you see scattered around. Basically, these are for 'memoization', to reduce data footprint, pickle saves strings with BINPUT <id>, and then if they come up again, instead of re-dumping them, simply puts a BINGET <id> to retrieve them from the cache.

Also, another complication! There's more than just SHORT_BINSTRING - there's normal BINSTRING for strings > 256 bytes, and also some fun unicode variants as well. I'll just assume that you're using Python 2 with all ASCII strings. Again, comment if this isn't a correct assumption.

OK, so we need to stream the file until we hit a '\81' bytes (NEWOBJ). Then, we need to scan forward until we hit a '(' (MARK) character. Then, until we hit a 'u' (SETITEMS), we read pairs of key/value strings - there should be 3 pairs total, one for each field.

So, lets do this. Here's my script to read pickle data in streaming fashion. It's far from perfect, since I just hacked it together for this answer, and you'll need to modify it a lot, but it's a good start.

pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)

import pickle # just for opcode names
import struct # binary unpacking

def try_memo(f, v, cache):
    opcode = f.read(1)
    if opcode == pickle.BINPUT:
        cache[f.read(1)] = v
    elif opcode == pickle.LONG_BINPUT:
        print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
        f.read(4)
    else:
        f.seek(f.tell() - 1) # rewind

def try_read_string(f, opcode, cache):
    if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
        length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
        str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
        value = f.read(str_length)
        try_memo(f, value, memo_cache)
        return value
    elif opcode == pickle.BINGET:
        return memo_cache[f.read(1)]
    elif opcide == pickle.LONG_BINGET:
        raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
    else:
        raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))

memo_cache = {}
while True:
    c = picklefile.read(1)
    if c == pickle.NEWOBJ:
        while picklefile.read(1) != pickle.MARK:
            pass # scan forward to field instantiation
        fields = {}
        while True:
            opcode = picklefile.read(1)
            if opcode == pickle.SETITEMS:
                break
            key = try_read_string(picklefile, opcode, memo_cache)
            value = try_read_string(picklefile, picklefile.read(1), memo_cache)
            fields[key] = value
        print 'Document', fields
        # insert to sqllite
    elif c == pickle.STOP:
        break

This correctly reads my test data in pickle format 2 (modified to have a long string):

$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}

Good luck!

Solution 2

You didn't pickle your data incrementally. You pickled your data monolithically and repeatedly. Each time around the loop, you destroyed whatever output data you had (open(...,'wb') destroys the output file), and re-wrote all of the data again. Additionally, if your program ever stopped and then restarted with new input data, the old output data was lost.

I do not know why objects didn't cause an out-of-memory error while you were pickling, because it grew to the same size as the object that pickle.load() wants to create.

Here is how you could have created the pickle file incrementally:

def save_objects(objects): 
    with open('objects.pkl', 'ab') as output:  # Note: `ab` appends the data
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():
    ...
    #objects=[] <-- lose the objects list
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            ... 
            save_objects(article)

Then you could have incrementally read the pickle file like so:

import pickle
with open('objects.pkl', 'rb') as pickle_file:
    try:
        while True:
            article = pickle.load(pickle_file)
            print article
    except EOFError:
        pass

The choices I can think of are:

  • Try cPickle. It might help.
  • Try streaming-pickle
  • Read your pickle file in a 64-bit environment with lots and lots of RAM
  • Re-crawl the original data, this time actually incrementally storing the data, or storing it in a database. Without the inefficiency of constantly re-writing your pickle output file, your crawling might go significantly faster this time.
Share:
29,511
Vineeth Bhuvanagiri
Author by

Vineeth Bhuvanagiri

Updated on July 05, 2022

Comments

  • Vineeth Bhuvanagiri
    Vineeth Bhuvanagiri almost 2 years

    The problem that I am having is that I have a very large pickle file (2.6 Gb) that I am trying to open but each time I do so I get a memory error. I realize now that I should have used a database to store all the information but its too late now. The pickle file contains dates and text from the U.S. Congressional record that was crawled from the internet (took about 2 weeks to run).

    Is there any way I can access the information that I dumped into the pickle file incrementally or a way to convert the pickle file into a SQL database or something else that I can open without having to re-input all the data. I really don't want to have to spend another 2 weeks re-crawling the congressional record and imputing the data into a database.

    Thanks a bunch for your help

    EDIT*

    code for how the object gets pickled:

    def save_objects(objects): 
        with open('objects.pkl', 'wb') as output: 
            pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)
    
    def Main():   
        Links()
        file = open('datafile.txt', 'w')
        objects = []
        with open('links2.txt', 'rb') as infile:
            for link in infile: 
                print(link)
                title, text, date = Get_full_text(link)
                article=Doccument(title, date, text)
                if text != None:
                    write_to_text(date, text)
                    objects.append(article)
                    save_objects(objects)
    

    This is the program with the error:

    def Main():
        file = open('objects1.pkl', 'rb') 
        object = pickle.load(file)
    
  • Vineeth Bhuvanagiri
    Vineeth Bhuvanagiri over 9 years
    thanks a bunch. I ran the crawler on one computer and then tried to view the pickle file on another machine. It works on the original machine which has more memory.
  • matth
    matth over 9 years
    Oh, then the answer is obvious: extract the data form the pickle file on the original computer.