efficient loop over numpy array

python arrays loops numpy optimization

14,442

Solution 1

Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.

It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.

from __future__ import print_function
import numpy as np
from numba import jit

N = 10000
vect = np.arange(N, dtype=np.float32)

vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)

print("init done")
# uncomment to enable compiled function
#@jit
def duplicates(i, counter, dupl, vect):
    eps = 0.01
    ns = len(vect)
    for j in range(i+1, ns):
        # replace if to use approx comparison
        #if abs(vect[i] - vect[j]) < eps:
        if vect[i] == vect[j]:
            dupl[counter] = j
            counter += 1
    return counter

counter = 0
for i in xrange(N):
    counter = duplicates(i, counter, dupl, vect)

print("counter =", counter)
print(dupl[0:counter])

Tests

# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 10.135 s

# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 0.480 s

The performance of compiled version (with @jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.

# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 109.218 s

# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 0.506 s

This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.

Solution 2

This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.

import numpy_indexed as npi
dpl = np.flatnonzero(npi.multiplicity(vect) > 1)

Solution 3

Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with

a = np.equal.outer(vect, vect)

Now, for example, to find the sum:

 >>> np.sum(a)
 10006

To find the indices of i that are equal, you can do

np.fill_diagonal(a, 0)

>>> np.nonzero(np.any(a, axis=0))[0]
array([   1, 2500, 5000])

Timing

def find_vec():
    a = np.equal.outer(vect, vect)
    s = np.sum(a)
    np.fill_diagonal(a, 0)
    return np.sum(a), np.nonzero(np.any(a, axis=0))[0]

>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop

def find_loop():
    dupl = []
    counter = 0
    for i in range(N):
        for j in range(i+1, N):
             if vect[i] == vect[j]:
                 dupl.append(j)
                 counter += 1
    return dupl

>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop

Solution 4

The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.

So given the code above I am not at all suprised that NumPy performance is worse than dreadful.

The following should be effectively what you want, I believe, but done the NumPy way:

import numpy as np

N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1

print([np.where(a == vect)[0] for a in vect][1])

# Delivers [1, 2500, 5000]

Solution 5

Approach #1

You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.

So, we would have a solution like so -

mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]

If you only care about the count counter, we could have two more approaches as listed next.

Approach #2

We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.

So, we would have a modified solution like so -

counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2

Approach #3

Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.

So, with that idea in mind, we would have a third approach like so -

count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())

View more solutions

14,442

Author by

ink

Updated on June 21, 2022

Comments

ink almost 2 years
Versions of this question have already been asked but I have not found a satisfactory answer.

Problem: given a large numpy vector, find indices of the vector elements which are duplicated (a variation of that could be comparison with tolerance).

So the problem is ~O(N^2) and memory bound (at least from the current algorithm point of view). I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
```
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = []
print("init done")
counter = 0
for i in range(N):
    for j in range(i+1, N):
        if vect[i] == vect[j]:
            dupl.append(j)
            counter += 1

print("counter =", counter)
print(dupl)
# For simplicity, this code ignores repeated indices 
# which can be trimmed later. Ref output is
# counter = 3
# [2500, 5000, 5000]
```
I tried using numpy iterators but they are even worse (~ x4-5) http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

Using N=10,000 I'm getting 0.1 sec in C, 12 sec in Python (code above), 40 sec in Python using np.nditer, 50 sec in Python using np.ndindex. I pushed it to N=160,000 and the timing scales as N^2 as expected.
- Tavian Barnes over 7 years
  
  Because Python is slow?
- kjaquier over 7 years
  
  numpy arrays are efficient when using built-in numpy function (that are implemented in C). Python loops are slow whether you use numpy or not. Try to implement your algorithm using only numpy functions. Using built-in Python functions and/or comprehensions should also increase performance (less than numpy but more than plain loops).
- ink over 7 years
  
  As such, loops in Python are not bad. And what's so difficult about looping anyway. I suspect it's nested looping which is killing this code (creating another context?)
- ink over 7 years
  
  I hate to answer my own question but I finally solved by resorting to Numba. Remembered it initially but then forgot. The timing is pretty much what C compiled code gives me and there is still one Python loop. So the outer loop is not a problem. I appreciate all the comments about using libraries. Indeed, they must be used in order to get the best performance. However I find it hard to remember all these calls and thinking in terms of code rather than libs is easier for me.
- hpaulj over 7 years
  
  That nditer page ends with a cython example. That's were you gain some speed. Otherwise nditer is just a way of handling broadcasting for multiple inputs and outputs.
vz0 over 7 years

What is the speedup gain by using this approach?
Ami Tavory over 7 years

@vz0 Relative to the loop?
vz0 over 7 years

OP posted in his question that it takes 12 seconds to run the pure python version. How much time would it take to run with your version?
Ami Tavory over 7 years

@vz0 See edit - loop takes by me around 8.5 secs, vs. 214 ms for the vectorized version.
vz0 over 7 years

Is the vectorized version still O(N**2) ?
Ami Tavory over 7 years

@vz0 Yes, but it's with very very different constants.
ink over 7 years

apart from probably needing a custom function if I need to compare element diffs with eps, this solution will need a massive matrix as N grows
ink over 7 years

Avoiding loops and vectorisation are not the same thing. Using vectorised code, you can use the full potential of your processor by doing multiple ops at once as well as have better memory prefetch. The sample code is fundamentally memory bound so vectorisation is strictly speaking useless there. Regardless, you need to loop over your vectorised code. Another disadvantage is your approach explodes quadratically as N grows.
ink over 7 years

nice one. i thought about np.where and tried it too. the problem is, i don't need 1 in the answer, only 2500 and 5000, ie repeats. secondly, the generalisation to near similarity is not so neat and a bit slower. but, yes, your one liner is as quick as C.
ink over 7 years

unfortunately the dictionary approach won't work for me because actually i need to compare array elements within certain tolerance abs(vect[i] - vect[j]) < eps
ink over 7 years

Unfortunately this is not suitable on the same grounds as Ami Tavory's suggestion. Try N=160000
vz0 over 7 years

You can then normalize the values in the dict to eps. abs(vect[i] - vect[j]) < eps -> abs(vect[i] - vect[j]) / eps < eps / eps -> abs(vect[i]/eps - vect[j]/eps) < 1.