Finding the Index of N biggest elements in Python Array / List Efficiently

20,101

Solution 1

Have you looked at the built-in numpy argsort method?:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

I can sort an array with 300,000 random floats in about 29 ms on my machine using that method.

def f(a,N):
    return np.argsort(a)[::-1][:N]

Solution 2

L = [4, 1, 0, 8, 5, 2]
sorted(range(len(L)), key=lambda i:L[i])

Solution 3

You can use heapq to do this easily enough:

>>> heapq.nlargest(3, zip(a, itertools.count()))
[(8, 3), (5, 4), (4, 5)]

Tuples are sorted by sorting on the first value, then the second, etc... This means that we can simply make a tuple of (value, index) and sort, giving us the indices of the values (the values are also given, but we can easily throw these away).

I am using zip() and itertools.count() as enumerate gives us the wrong order, so they will be sorted by index, rather than by value. Alternatively, you could also do ((value, index) for index, value in enumerate(a)), but I feel that is less clear.

Another alternative is to give a key, doing heapq.nlargest(3, enumerate(a), key=operator.itemgetter(1)).

Solution 4

Another way to use heapq

heapq.nlargest(n, range(len(a)), key=a.__getitem__)

As commented elsewhere, it won't beat sorting unless a is very large and n<<len(a) because sorting is a relatively fast operation in Python. However eventually a slow O(n) will always beat the O(n*log(n))

Share:
20,101

Related videos on Youtube

Willian Fuks
Author by

Willian Fuks

Just a regular data geek :) Just implemented a Causal Impact algorithm on top of TensorFlow Probability, if you are working with time series you can check the project here.

Updated on August 25, 2020

Comments

  • Willian Fuks
    Willian Fuks over 3 years

    I'm sorry in advance if this is a duplicated question, I looked for this information but still couldn't find it.

    Is it possible to arrange a numpy array (or python list) by using the indexes of the N biggest elements in decreasing order very efficiently?

    For instance, the array:

    a = array([4, 1, 0, 8, 5, 2])
    

    The indexes of the biggest elements in decreasing order would give (considering N = 6, all the elements are included):

    8 --> 3

    5 --> 4

    4 --> 0

    2 --> 5

    1 --> 1

    0 --> 2

    result = [3, 4, 0, 5, 1, 2]
    

    I know how to make it using a somewhat silly approach (like sorting the array and searching for each of the N numbers for their indexes), but I was wondering if is there any efficient library like bottleneck or heapq or maybe a pythonic approach to make this very fast. I have to apply it in several arrays with 300k elements each so that's why performance is an issue.

    Thanks in advance!

    UPDATE

    I read the answers and decided to timeit them using a 300k of random integers, here are the results:

    solution 1: sorted(range(len(a)), key=lambda i:a[i]) time: 230 ms

    solution 2: heapq.nlargest(len(a), zip(a, itertools.count())) time: 396 ms

    solution 3: heapq.nlargest(len(a), enumerate(a), key=operator.itemgetter(1)) time: 864 ms

    solution 4: def f(a,N): return np.argsort(a)[::-1][:N] (N = len(a)) time: 104 ms

    Thanks a lot for the fast and very good answers!

    • senderle
      senderle over 11 years
    • senderle
      senderle over 11 years
      If you follow the trail of duplicates, this answer pops up, which seems promising -- although the post is by the developer, a fact that the answer does not disclose...
    • lizzie
      lizzie about 11 years
      In your test, what is the value of N ? As explained above, using heapq is efficient is N is rather small compare to len(a).
    • Superbest
      Superbest about 9 years
      How do you modify these for N < len(a)?
    • stanleyli
      stanleyli almost 8 years
      I agree with @lizzie. Can you provide the value of N and len(a) in your experiment? I think heapq.nlargest should be more efficient than np.argsort if N is much smaller than len(a).
    • tinix84
      tinix84 about 4 years
  • mgilson
    mgilson over 11 years
    More clever than me. +1 to you sir.
  • Gareth Rees
    Gareth Rees over 11 years
    key = L.__getitem__ is an alternative (that may be a bit faster in some cases).
  • inspectorG4dget
    inspectorG4dget over 11 years
    @GarethRees: You're right! I hadn't thought of that. lambdas /are/ slow
  • Gareth Rees
    Gareth Rees over 11 years
    I tried a simple test and didn't see much difference, so it wouldn't be wrong to go with the lambda.
  • Matt
    Matt over 11 years
    The python documentation recommends using sorted() instead of heapq.nlargest() when working with large lists, although they don't clarify how big a "large list" is. docs.python.org/library/heapq.html
  • Gareth Latty
    Gareth Latty over 11 years
    @Matt I can't find the suggestion saying that in the docs - but that may well be the case - I'd suggest the OP runs some timeit tests to find out what's the most efficient for his use.
  • Willian Fuks
    Willian Fuks over 11 years
    This worked amazingly well! In my machine it's taking 104 ms (it's very busy right now), later on I'll try it again, but so far this has been the fastest solution. Tnx!
  • Willian Fuks
    Willian Fuks over 11 years
    I tried to make it work using the getitem but since I'm very newbie in python couldn't make it work correctly, but the solution using lambda worked well here, thanks for the help!
  • inspectorG4dget
    inspectorG4dget over 11 years
    for getitem: change, let key=L.__getitem__
  • Matt
    Matt over 11 years
    @Lattyware At docs.python.org/library/heapq.html it says "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function. Also, when n==1, it is more efficient to use the built-in min() and max() functions."
  • Henry Thornton
    Henry Thornton over 11 years
    @joshadel This function argsort's first and then returns the top-N values. Is there a Numpy/Scipy function that is equivalent to the Python heapq.nlargest(N, a) but for finding the top-N index values without argsort'ing the entire array?
  • JoshAdel
    JoshAdel over 11 years
    @dbv maybe something like berkeleyanalytics.com/bottleneck/…, but that seems to only work for the smallest values.
  • Piyush Deshmukh
    Piyush Deshmukh over 8 years
    Yes, you are correct that slow O(n) beats O(n*log(n)) for large n, but heapq module has been implemented intelligently, taking care that the n value passed to nlargest() function will use heap only if n is relatively small and switch to sorting when n is significantly larger and tending to sizeof list.
  • Joffer
    Joffer over 7 years
    nlargest from at least v2.7.11 will use a heap regardless, but v3.5.2 behaves as you describe @sinister
  • Joffer
    Joffer over 7 years
    actually, in v3.5.2, sorted is only used when n is larger than the size of the iterable @sinister