Difference on performance between numpy and matlab

31,765

Solution 1

It would be wrong to say "Matlab is always faster than NumPy" or vice versa. Often their performance is comparable. When using NumPy, to get good performance you have to keep in mind that NumPy's speed comes from calling underlying functions written in C/C++/Fortran. It performs well when you apply those functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.

What's wrong with a Python loop you ask? Every iteration through the Python loop is a call to a next method. Every use of [] indexing is a call to a __getitem__ method. Every += is a call to __iadd__. Every dotted attribute lookup (such as in like np.dot) involves function calls. Those function calls add up to a significant hinderance to speed. These hooks give Python expressive power -- indexing for strings means something different than indexing for dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.

But that expressive power comes at a cost in speed. So when you don't need all that dynamic expressivity, to get better performance, try to limit yourself to NumPy function calls on whole arrays.

So, remove the for-loop; use "vectorized" equations when possible. For example, instead of

for i in range(m):
    delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])    

you can compute delta3 for each i all at once:

delta3 = -(x-a3)*a3*(1-a3)

Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.


Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )

Here is a runnable example with an alternative implementation (alt) of your code (orig).

My timeit benchmark shows a 6.8x improvement in speed:

In [52]: %timeit orig()
1 loops, best of 3: 495 ms per loop

In [53]: %timeit alt()
10 loops, best of 3: 72.6 ms per loop

import numpy as np


class Bunch(object):
    """ http://code.activestate.com/recipes/52308 """
    def __init__(self, **kwds):
        self.__dict__.update(kwds)

m, n, p = 10 ** 4, 64, 25

sparse = Bunch(
    theta1=np.random.random((p, n)),
    theta2=np.random.random((n, p)),
    b1=np.random.random((p, 1)),
    b2=np.random.random((n, 1)),
)

x = np.random.random((m, n))
a3 = np.random.random((m, n))
a2 = np.random.random((m, p))
a1 = np.random.random((m, n))
sum2 = np.random.random((p, ))
sum2 = sum2[:, np.newaxis]

def orig():
    partial_j1 = np.zeros(sparse.theta1.shape)
    partial_j2 = np.zeros(sparse.theta2.shape)
    partial_b1 = np.zeros(sparse.b1.shape)
    partial_b2 = np.zeros(sparse.b2.shape)
    delta3t = (-(x - a3) * a3 * (1 - a3)).T
    for i in range(m):
        delta3 = delta3t[:, i:(i + 1)]
        sum1 = np.dot(sparse.theta2.T, delta3)
        delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)
        partial_j1 += np.dot(delta2, a1[i:(i + 1), :])
        partial_j2 += np.dot(delta3, a2[i:(i + 1), :])
        partial_b1 += delta2
        partial_b2 += delta3
        # delta3: (64, 1)
        # sum1: (25, 1)
        # delta2: (25, 1)
        # a1[i:(i+1),:]: (1, 64)
        # partial_j1: (25, 64)
        # partial_j2: (64, 25)
        # partial_b1: (25, 1)
        # partial_b2: (64, 1)
        # a2[i:(i+1),:]: (1, 25)
    return partial_j1, partial_j2, partial_b1, partial_b2


def alt():
    delta3 = (-(x - a3) * a3 * (1 - a3)).T
    sum1 = np.dot(sparse.theta2.T, delta3)
    delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)
    # delta3: (64, 10000)
    # sum1: (25, 10000)
    # delta2: (25, 10000)
    # a1: (10000, 64)
    # a2: (10000, 25)
    partial_j1 = np.dot(delta2, a1)
    partial_j2 = np.dot(delta3, a2)
    partial_b1 = delta2.sum(axis=1)
    partial_b2 = delta3.sum(axis=1)
    return partial_j1, partial_j2, partial_b1, partial_b2

answer = orig()
result = alt()
for a, r in zip(answer, result):
    try:
        assert np.allclose(np.squeeze(a), r)
    except AssertionError:
        print(a.shape)
        print(r.shape)
        raise

Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute

np.dot(A, B)

and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).


It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:

Below, x is the same as xf except that x is C_CONTIGUOUS and xf is F_CONTIGUOUS -- and the same relationship for y and yf.

import numpy as np

m, n, p = 10 ** 4, 64, 25
x = np.random.random((n, m))
xf = np.asarray(x, order='F')

y = np.random.random((m, n))
yf = np.asarray(y, order='F')

assert np.allclose(x, xf)
assert np.allclose(y, yf)
assert np.allclose(np.dot(x, y), np.dot(xf, y))
assert np.allclose(np.dot(x, y), np.dot(xf, yf))

%timeit benchmarks show the difference in speed:

In [50]: %timeit np.dot(x, y)
100 loops, best of 3: 12.9 ms per loop

In [51]: %timeit np.dot(xf, y)
10 loops, best of 3: 27.7 ms per loop

In [56]: %timeit np.dot(x, yf)
10 loops, best of 3: 21.8 ms per loop

In [53]: %timeit np.dot(xf, yf)
10 loops, best of 3: 33.3 ms per loop

Regarding benchmarking in Python:

It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python. You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.

I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.

Solution 2

I would start with inplace operations to avoid to allocate new arrays every time:

partial_j1 += np.dot(delta2, a1[i,:].reshape(1,a1.shape[1]))
partial_j2 += np.dot(delta3, a2[i,:].reshape(1,a2.shape[1]))
partial_b1 += delta2
partial_b2 += delta3

You can replace this expression:

a1[i,:].reshape(1,a1.shape[1])

with a simpler and faster (thanks to Bi Rico):

a1[i:i+1]

Also, this line:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest))

seems to be the same at each loop, you don't need to recompute it.

And, a probably minor optimization, you can replace all the occurrences of x[i,:] with x[i].

Finally, if you can afford to allocate the m times more memory, you can follow unutbu suggestion and vectorize the loop:

for m in range(m):
    delta3 = -(x[i]-a3[i])*a3[i]* (1 - a3[i])

with:

delta3 = -(x-a3)*a3*(1-a3)

And you can always use Numba and gain in speed significantly without vectorizing (and without using more memory).

Solution 3

Difference in performance between numpy and matlab have always frustrated me. They often in the end boil down to the underlying lapack libraries. As far as I know matlab uses the full atlas lapack as a default while numpy uses a lapack light. Matlab reckons people dont care about space and bulk, while numpy reckons people do. Similar question with a good answer.

Share:
31,765

Related videos on Youtube

pabaldonedo
Author by

pabaldonedo

Data scientist. Knowledge on python and machine learning frameworks like pandas, sklearn or TensorFlow. I have also worked with opengl, c++ and assembler.

Updated on July 09, 2022

Comments

  • pabaldonedo
    pabaldonedo almost 2 years

    I am computing the backpropagation algorithm for a sparse autoencoder. I have implemented it in python using numpy and in matlab. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?

    Numpy implementation:

    Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.

    UPDATE: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.

    partial_j1 = np.zeros(sparse.theta1.shape)
    partial_j2 = np.zeros(sparse.theta2.shape)
    partial_b1 = np.zeros(sparse.b1.shape)
    partial_b2 = np.zeros(sparse.b2.shape)
    t = time.time()
    
    delta3t = (-(x-a3)*a3*(1-a3)).T
    
    for i in range(m):
    
        delta3 = delta3t[:,i:(i+1)]
        sum1 =  np.dot(sparse.theta2.T,delta3)
        delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)
        partial_j1 += np.dot(delta2, a1[i:(i+1),:])
        partial_j2 += np.dot(delta3, a2[i:(i+1),:])
        partial_b1 += delta2
        partial_b2 += delta3
    
    print "Backprop time:", time.time() -t
    

    Matlab implementation:

    tic
    for i = 1:m
    
        delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));
        delta3 = delta3.';
        sum1 =  W2.'*delta3;
        sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );
        delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');
        W1grad = W1grad + delta2* a1(i,:);
        W2grad = W2grad + delta3* a2(i,:);
        b1grad = b1grad + delta2;
        b2grad = b2grad + delta3;
    end
    toc
    
    • Koustav Ghosal
      Koustav Ghosal over 10 years
      there is a module called mlabwrap. You can use matlab as a python library by importing this. The syntax is very simple. You will find the source and detail documentation here.mlabwrap.sourceforge.net
    • Bakuriu
      Bakuriu over 10 years
      Take a look at cython. The difference in time is expected, since MATLAB has a JIT, and CPython doesn't. If all the code was a single numpy call then the times would be similar but what you see could be interpreting overhead. Writing an extension with cython is really easy and you might achieve big gains adding some types to variables in the right places.
    • hpaulj
      hpaulj over 10 years
      What's the shape of data? Specifically how does m compare with the other dimension?
    • pabaldonedo
      pabaldonedo over 10 years
      m = 10000, x is a 10000x64 matrix, theta1 is a 25x64 matrix and theta2 64x25.
    • hpaulj
      hpaulj over 10 years
      If you can't work with x as a whole matrix, it is better to loop on the small dimension than the large one. But that may require some ingenuity.
  • pabaldonedo
    pabaldonedo over 10 years
    I have checked and the inplace operations make almost no difference.
  • Bi Rico
    Bi Rico over 10 years
    a1[i,:].reshape(1,a1.shape[1]) can we written as a[i:i+1]
  • user2304916
    user2304916 over 10 years
    In this case I can hardly believe is LAPACK to blame since they only use the dot product. More probably MATLAB does some jit to speedul the loop.
  • user2304916
    user2304916 over 10 years
    Bi Rico, I don't think so.
  • hpaulj
    hpaulj over 10 years
    My experience is that numpy runs about the same speed (or at worst half) as an older Matlab or Octave. But new Matlab versions appear to be vectorizing or compiling (jit) more aggressively. For someone experienced in 'old' Matlab for i = 1:m and a3(i,:) are slow code flags.
  • pabaldonedo
    pabaldonedo over 10 years
    precomputing delta3before the for-loopand take sum2 outside helps (I have updated the question) but it is still over twice as slower than matlab. What also impress me is that the time it takes to matlab to compute delta3 inside the for-loop is almost the same that it takes to numpy to access a row of a precomputed delta3 as a matrix as I have now. Is this always numpy so slow compared to matlab?
  • pabaldonedo
    pabaldonedo over 10 years
    Thanks for your thoroughly response but the operation sum1+sum2 crashes in my computer, sum1 has dimensions 25,10000 while sum2 is (25,)
  • pabaldonedo
    pabaldonedo over 10 years
    I have changed the sumation adding a previous line as follows sum2 = np.dot(sum2.reshape(-1,1),np.ones((1,sum1.shape[1]))). Now it works, is there any better way to do it? thanks very much for your response.
  • unutbu
    unutbu over 10 years
    You could use sum2 = sum2[:, np.newaxis] to convert sum2 from an array of shape (25,) to an array of shape (25,1). NumPy broadcasting will take care of "upgrading" it to shape (25, 10000) without consuming unnecessary memory repeating the same values 10000 times. sum2[:, np.newaxis] is about 4300x faster than np.dot(sum2.reshape(-1,1),np.ones((1,sum1.shape[1]))) on my computer. Of course, we're only doing this once so the total speed gain is negligible. Still, it is a good trick to know.
  • Amro
    Amro over 10 years
    fwiw, MATLAB stopped using ATLAS in favor of Intel MKL for a while now (starting with v7 I think, thats more than 10 years ago). You can have NumPy compiled against MKL as well. Christoph Gohlke provides NumPy-MKL windows binaries: lfd.uci.edu/~gohlke/pythonlibs/#numpy
  • pabaldonedo
    pabaldonedo over 10 years
    thanks so much for that trick I had searched for it before because I had encountered myself with the same problem several times before but I had not found it. Now NumPy runs the code in 0.125s (great improvement). To make a fair comparison I have recoded the Matlab script to avoid the for-loop so now it looks like the new NumPy code. Matlab is still faster running in 0.02s. And in Matlab for sum2 I am using a dot product because I do not know any trick like the one you told me for Numpy.
  • pabaldonedo
    pabaldonedo over 10 years
    I have improved slightly the results of NumPyto 0.10schanging the line sum1 = (sparse.theta2.T, delta3) to `` sum1 = scipy.linalg.blas.dgemm(alpha=1.0, a=sparse.theta2.T, b=delta3, trans_b=False)`` because both matrices are fortran Contiguous. Could I optimize it even more redesigning it so all the operations (or at least most of them) were between fortran contiguous matrices or between C contiguous matrices and then apply the most convenient dot product operator or the improvemnet would be negligible?
  • unutbu
    unutbu over 10 years
    Yes, using C_CONTIGUOUS arrays would be much faster if using np.dot (and probably for many other NumPy functions as well). I've added some code above showing that np.dot(x, y) is about 3x faster using C_CONTIGUOUS arrays than np.dot(xf, yf) using F_CONTIGUOUS arrays. So if you can build your arrays as C_CONTIGUOUS from the start, you might see a performance gain.
  • hpaulj
    hpaulj over 10 years
    sum2 = np.random.random((p,1)) produces the (25,1) array right away.
  • unutbu
    unutbu over 10 years
    @hpaulj: That's true, but pabaldonedo is starting with an array of shape (25, ). He needs a way to reshape it to (25, 1). np.random.random((p, )) is just an array I made to serve as a substitute for his real array.
  • Philliproso
    Philliproso over 10 years
    Yes this is more likely to be a feature of the jit I agree. Could this speed be improved with the introduction of numpypy? Matlabs jit is pretty amazing finding syntactically similar matlab routines and calling pre-compiled bits of C code. If you code in matlab like you would code in C its as fast as if you are actually coding in C because its already running pre-compiled C.
  • Thomas Browne
    Thomas Browne over 9 years
    Using the Intel MKL version of Numpy (Anaconda Pro) I get a 21x improvement. So MKL is more than 3x faster on this code than "standard" numpy.
  • Cris Luengo
    Cris Luengo almost 5 years
    “It would be wrong to say "Matlab is always faster than NumPy" or vice versa.“ Except that MATLAB is much, much faster at interpreting code than Python. MATLAB compiles code on the fly (JIT compiler), Python doesn’t, so reinterprets every line of code every time it is executed. Python is slower than MATLAB.