How can I speed up unpickling large objects if I have plenty of RAM?

18,000

Solution 1

You're probably bound by Python object creation/allocation overhead, not the unpickling itself. If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

Solution 2

I had great success in reading a ~750 MB igraph data structure (a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

Example snippet in your case would be something like:

import cPickle as pickle
import gc

f = open("bigNetworkXGraph.pickle", "rb")

# disable garbage collector
gc.disable()

graph = pickle.load(f)

# enable garbage collector again
gc.enable()
f.close()

This definitely isn't the most apt way to do it, however, it reduces the time required drastically.
(For me, it reduced from 843.04s to 41.28s, around 20x)

Solution 3

Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.

Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

Solution 4

This is ridiculous.

I have a huge ~150MB dictionary (collections.Counter actually) that I was reading and writing using cPickle in the binary format.

Writing it took about 3 min.
I stopped reading it in at the 16 min mark, with my RAM completely choked up.

I'm now using marshal, and it takes: write: ~3s
read: ~5s

I poked around a bit, and came across this article.
Guess I've never looked at the pickle source, but it builds an entire VM to reconstruct the dictionary?
There should be a note about performance on very large objects in the documentation IMHO.

Solution 5

I'm also trying to speed up the loading/storing of networkx graphs. I'm using the adjacency_graph method to convert the graph to something serialisable, see for instance this code:

from networkx.generators import fast_gnp_random_graph
from networkx.readwrite import json_graph

G = fast_gnp_random_graph(4000, 0.7)

with open('/tmp/graph.pickle', 'wb+') as f:
  data = json_graph.adjacency_data(G)
  pickle.dump(data, f)

with open('/tmp/graph.pickle', 'rb') as f:
  d = pickle.load(f)
  H = json_graph.adjacency_graph(d)

However, this adjacency_graph conversion method is quite slow, so time gained in pickling is probably lost on converting.

So this actually doesn't speed things up, bummer. Running this code gives the following timings:

N=1000

    0.666s ~ generating
    0.790s ~ converting
    0.237s ~ storing
    0.295s ~ loading
    1.152s ~ converting

N=2000

    2.761s ~ generating
    3.282s ~ converting
    1.068s ~ storing
    1.105s ~ loading
    4.941s ~ converting

N=3000

    6.377s ~ generating
    7.644s ~ converting
    2.464s ~ storing
    2.393s ~ loading
    12.219s ~ converting

N=4000

    12.458s ~ generating
    19.025s ~ converting
    8.825s ~ storing
    8.921s ~ loading
    27.601s ~ converting

This exponential growth is probably due to the graph getting exponentially more edges. Here is a test gist, in case you want to try yourself

https://gist.github.com/wires/5918834712a64297d7d1

Share:
18,000

Related videos on Youtube

conradlee
Author by

conradlee

Data engineer and analyst who contracts for Parsely. Interested in machine learning and network analysis. Here are some of my past projects: Got a PhD in computational social science (with a focus on clustering social network data, but also looked at supervised ML problems like link prediction). Wrote these papers in that time. Founded a machine learning startup called Synference, which was later acquired by Optimizely. Wrote a data blog sociograph.blogspot.com Moved to Europe from the US, and (luckily) got stuck in Vienna

Updated on April 19, 2020

Comments

  • conradlee
    conradlee about 4 years

    It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).

    Note that the file quickly loads into memory. In other words, if I run:

    import cPickle as pickle
    
    f = open("bigNetworkXGraph.pickle","rb")
    binary_data = f.read() # This part doesn't take long
    graph = pickle.loads(binary_data) # This takes ages
    

    How can I speed this last operation up?

    Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.

    I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.

    • Enrico Carlesso
      Enrico Carlesso about 14 years
      Do you really mean 128GB of ram? Or 128MB?
  • SilentGhost
    SilentGhost about 14 years
    he has 128GB of RAM, why would he do all the splitting?
  • wump
    wump about 14 years
    That probably won't help; the reading part fast enough, and there is enough memory, so unpickling directly from the stream won't gain much.
  • conradlee
    conradlee about 14 years
    No, I mean 128 GB - it's a big machine. I've also got 24 cores to use, so a parallel solution would be nice, although I guess the GIL will not really make this possible.
  • conradlee
    conradlee about 14 years
    That's the first thing I tried. I show the more complicated way of loading a pickle file to illustrate that reading the binary data into ram does not seem be be the bottleneck.
  • conradlee
    conradlee about 14 years
    Alright, I can think of some ways to break up this data to some extent, (putting node attributes in different files), but the edges of the graph object alone take up a lot of memory---and if I have to store these in different files, and re-populate my graph every time I use it, then what's the point of serialization? I might as well just store my graph data in an edge-list.
  • wump
    wump about 14 years
    I indeed don't think serialization is the best solution for your problem. Pickle was never meant to be scalable for huge data structures. This is more the realm of database-like formats that supporting random-access and on demand loading.
  • Enrico Carlesso
    Enrico Carlesso about 14 years
    Wooops! BIG machine :) Sorry for misunderstood!
  • alvas
    alvas over 7 years
    why do you need to enable gc again? And would pickle from Python3 with protocol=4 be more efficient?
  • Tejas Shah
    Tejas Shah over 7 years
    @alvas isn't is a good practice to keep gc enabled by default? Also, I haven't tried protocol=4 from Python3 yet w.r.t the above scenario.
  • Matthew
    Matthew about 5 years
    @TejasShah did you by chance compare to joblib? I did this with a joblib load and it was the same time with and without toggling the garbage collection.
  • Cyzanfar
    Cyzanfar about 4 years
    "this" link doesn't exist anymore.
  • ktdrv
    ktdrv about 4 years
    Almost 10 years later? No surprise. I guess you'll just have to take my word for it or benchmark it yourself.