Cython inline function with numpy array as parameter
Solution 1
The problem is that assigning a numpy array (or, equivalently, passing it in as a function argument) is not just a simple assignment, but a "buffer extraction" which populates a struct and pulls out the stride and pointer information into local variables needed for fast indexing. If you're iterating over a moderate number of elements, this O(1) overhead is easily amortized over the loop, but that is certainly not the case for small functions.
Improving this is high on many people's wishlist, but it's a non-trivial change. See, e.g., the discussion at http://groups.google.com/group/cython-users/browse_thread/thread/8fc8686315d7f3fe
Solution 2
More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question):
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
arr[i, j]+= 1
def test1(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
I get the following timings:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop
So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(int[:, ::1] tmv, int i, int j):
tmv[i, j]+= 1
def test3(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
inc(tmv, i, j)
def test4(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
tmv[i,j] += 1
With this I get:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop
We are almost there and already faster than the old-fashioned way! Now, the inc()
function is eligible to be declared nogil
, so let's declare it so! But oops:
Error compiling Cython file:
[...]
cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
^
[...]
Function with Python return type cannot be declared nogil
Aaah, I totally missed that the void
return type was missing! Once again but now with void
:
cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
tmv[i, j]+= 1
And finally I get:
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop
As fast as manual inlining!
Now, just for fun, I tried Numba on this code:
import numpy as np
from numba import autojit, jit
@autojit
def inc(arr, i, j):
arr[i, j] += 1
@autojit
def test5(arr):
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
I get:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop
Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline inc()
, I think it is AWESOME! All I needed to do is to add @autojit
and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing!
I have tried other things with Numba, such as
@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
arr[i, j] += 1
or nopython=True
but failed to improve it any further.
Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)
Solution 3
You are passing the array to inc()
as a Python object of type numpy.ndarray
. Passing Python objects is expensive due to issues like reference counting, and it seems to prevent inlining. If you pass the array the C way, i.e. as a pointer, test1()
becomes even faster than test2()
on my machine:
cimport numpy as np
cdef inline inc(int* arr, int i):
arr[i] += 1
def test1(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
inc(<int*>arr.data, i)
Maxim
Updated on June 05, 2022Comments
-
Maxim about 2 years
Consider code like this:
import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t] arr, int i): arr[i]+= 1 def test1(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): inc(arr, i) def test2(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): arr[i] += 1
I used ipython to measure speed of test1 and test2:
In [7]: timeit ttt.test1(arr) 100 loops, best of 3: 6.13 ms per loop In [8]: timeit ttt.test2(arr) 100000 loops, best of 3: 9.79 us per loop
Is there a way to optimize test1? Why doesn't cython inline this function as told?
UPDATE: Actually what I need is multidimension code like this:
# cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j): arr[i, j] += 1 def test1(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1
Timing for it:
In [7]: timeit ttt.test1(arr) 1 loops, best of 3: 647 ms per loop In [8]: timeit ttt.test2(arr) 100 loops, best of 3: 2.07 ms per loop
Explicit inlining gives 300x speedup. And my real function is quite big so inlining it makes code maintainability much worse
UPDATE2:
# cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j): arr[i, j]+= 1 def test1(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 cdef class FastPassingFloat2DArray(object): cdef float* data cdef int stride0, stride1 def __init__(self, np.ndarray[np.float32_t, ndim=2] arr): self.data = <float*>arr.data self.stride0 = arr.strides[0]/arr.dtype.itemsize self.stride1 = arr.strides[1]/arr.dtype.itemsize def __getitem__(self, tuple tp): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) r = pr[0] return r def __setitem__(self, tuple tp, float value): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) pr[0] = value cdef inline inc2(FastPassingFloat2DArray arr, int i, int j): arr[i, j]+= 1 def test3(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr) for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc2(tmparr, i,j)
Timings:
In [4]: timeit ttt.test1(arr) 1 loops, best of 3: 623 ms per loop In [5]: timeit ttt.test2(arr) 100 loops, best of 3: 2.29 ms per loop In [6]: timeit ttt.test3(arr) 1 loops, best of 3: 201 ms per loop
-
Maxim over 13 yearsOk, and what about 2d and 3d arrays?
-
Sven Marnach over 13 years@Maxim: Your own code only works for one-dimensional arrays, so I provided a faster version for this case only. (Note that
ndim=1
is implicit if you don't provide an explicitndim
parameter tondarray
.) When I addndim=2
to your code and timetest1()
andtest2()
with a 50x50 array, there is hardly any performance difference between them on by machine. -
Maxim over 13 yearsPlease see update in the question. Here I get huge performance difference on ndim=2 also (which is expected because if inc is not inlined it acquires and releases numpy buffer on every call). And passing just pointer is not enough in nD case, because you will also need to know sizes in each dimension, and passing them all makes function look bad, and makes each array access complicated....
-
Maxim over 13 yearsYeah, that was my question. I will just accept your answer for now.