Techniques for working with large Numpy arrays?

28,862

Solution 1

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.

I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:

  1. Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
  2. Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.

A couple of things you can do to handle this:

1. Divide and conquer

Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.

2. Cache expensive computations

This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.

3. Use your dtypes wisely

If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

Solution 2

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)

Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.

Third: as pointed out by @Jaime, work un block sub-matrices, if the whole matrix is to big.

EDIT:

Avoid unecessary list comprehension, as pointed out in this answer in SE.

Solution 3

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.

You could also look into Spartan, Distarray, and Biggus.

Solution 4

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it

  1. will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,

  2. uses all cores of your dual or quad core CPU,

  3. is an extension to numpy, not an alternative.

For medium and large sized arrays, it is faster that numpy alone.

Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

Share:
28,862

Related videos on Youtube

Noob Saibot
Author by

Noob Saibot

http://i.imgur.com/xVyoSl.jpg

Updated on January 23, 2022

Comments

  • Noob Saibot
    Noob Saibot over 2 years

    There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).

    Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

  • Jaime
    Jaime over 11 years
    @NoobSaibot Could you explain what do you have in mind whith "variable arrays"?
  • Noob Saibot
    Noob Saibot over 11 years
    Sorry. Variable-sized arrays. In your Cache expensive computations section, you give a genius example for when your values exist in a finite set. How would you apply #2 if your example called for x,y,z possible outcomes of any size?