Finding the Index of N biggest elements in Python Array / List Efficiently
Solution 1
Have you looked at the built-in numpy argsort
method?:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
I can sort an array with 300,000 random floats in about 29 ms on my machine using that method.
def f(a,N):
return np.argsort(a)[::-1][:N]
Solution 2
L = [4, 1, 0, 8, 5, 2]
sorted(range(len(L)), key=lambda i:L[i])
Solution 3
You can use heapq
to do this easily enough:
>>> heapq.nlargest(3, zip(a, itertools.count()))
[(8, 3), (5, 4), (4, 5)]
Tuples are sorted by sorting on the first value, then the second, etc... This means that we can simply make a tuple of (value, index)
and sort, giving us the indices of the values (the values are also given, but we can easily throw these away).
I am using zip()
and itertools.count()
as enumerate gives us the wrong order, so they will be sorted by index, rather than by value. Alternatively, you could also do ((value, index) for index, value in enumerate(a))
, but I feel that is less clear.
Another alternative is to give a key, doing heapq.nlargest(3, enumerate(a), key=operator.itemgetter(1))
.
Solution 4
Another way to use heapq
heapq.nlargest(n, range(len(a)), key=a.__getitem__)
As commented elsewhere, it won't beat sorting unless a is very large and n<<len(a)
because sorting is a relatively fast operation in Python. However eventually a slow O(n) will always beat the O(n*log(n))
Related videos on Youtube
Willian Fuks
Just a regular data geek :) Just implemented a Causal Impact algorithm on top of TensorFlow Probability, if you are working with time series you can check the project here.
Updated on August 25, 2020Comments
-
Willian Fuks over 3 years
I'm sorry in advance if this is a duplicated question, I looked for this information but still couldn't find it.
Is it possible to arrange a numpy array (or python list) by using the indexes of the N biggest elements in decreasing order very efficiently?
For instance, the array:
a = array([4, 1, 0, 8, 5, 2])
The indexes of the biggest elements in decreasing order would give (considering N = 6, all the elements are included):
8 --> 3
5 --> 4
4 --> 0
2 --> 5
1 --> 1
0 --> 2
result = [3, 4, 0, 5, 1, 2]
I know how to make it using a somewhat silly approach (like sorting the array and searching for each of the N numbers for their indexes), but I was wondering if is there any efficient library like bottleneck or heapq or maybe a pythonic approach to make this very fast. I have to apply it in several arrays with 300k elements each so that's why performance is an issue.
Thanks in advance!
UPDATE
I read the answers and decided to timeit them using a 300k of random integers, here are the results:
solution 1:
sorted(range(len(a)), key=lambda i:a[i])
time: 230 mssolution 2:
heapq.nlargest(len(a), zip(a, itertools.count()))
time: 396 mssolution 3:
heapq.nlargest(len(a), enumerate(a), key=operator.itemgetter(1))
time: 864 mssolution 4:
def f(a,N): return np.argsort(a)[::-1][:N] (N = len(a))
time: 104 msThanks a lot for the fast and very good answers!
-
senderle over 11 yearspossible duplicate of How to get the N maximum values in a numpy array?
-
senderle over 11 yearsIf you follow the trail of duplicates, this answer pops up, which seems promising -- although the post is by the developer, a fact that the answer does not disclose...
-
lizzie about 11 yearsIn your test, what is the value of N ? As explained above, using heapq is efficient is N is rather small compare to len(a).
-
Superbest about 9 yearsHow do you modify these for
N < len(a)
? -
stanleyli almost 8 yearsI agree with @lizzie. Can you provide the value of
N
andlen(a)
in your experiment? I thinkheapq.nlargest
should be more efficient thannp.argsort
ifN
is much smaller thanlen(a)
. -
tinix84 about 4 years
-
-
mgilson over 11 yearsMore clever than me. +1 to you sir.
-
Gareth Rees over 11 years
key = L.__getitem__
is an alternative (that may be a bit faster in some cases). -
inspectorG4dget over 11 years@GarethRees: You're right! I hadn't thought of that.
lambda
s /are/ slow -
Gareth Rees over 11 yearsI tried a simple test and didn't see much difference, so it wouldn't be wrong to go with the
lambda
. -
Matt over 11 yearsThe python documentation recommends using sorted() instead of heapq.nlargest() when working with large lists, although they don't clarify how big a "large list" is. docs.python.org/library/heapq.html
-
Gareth Latty over 11 years@Matt I can't find the suggestion saying that in the docs - but that may well be the case - I'd suggest the OP runs some
timeit
tests to find out what's the most efficient for his use. -
Willian Fuks over 11 yearsThis worked amazingly well! In my machine it's taking 104 ms (it's very busy right now), later on I'll try it again, but so far this has been the fastest solution. Tnx!
-
Willian Fuks over 11 yearsI tried to make it work using the getitem but since I'm very newbie in python couldn't make it work correctly, but the solution using lambda worked well here, thanks for the help!
-
inspectorG4dget over 11 yearsfor getitem: change, let
key=L.__getitem__
-
Matt over 11 years@Lattyware At docs.python.org/library/heapq.html it says "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function. Also, when n==1, it is more efficient to use the built-in min() and max() functions."
-
Henry Thornton over 11 years@joshadel This function argsort's first and then returns the top-N values. Is there a Numpy/Scipy function that is equivalent to the Python heapq.nlargest(N, a) but for finding the top-N index values without argsort'ing the entire array?
-
JoshAdel over 11 years@dbv maybe something like berkeleyanalytics.com/bottleneck/…, but that seems to only work for the smallest values.
-
Piyush Deshmukh over 8 yearsYes, you are correct that slow
O(n)
beatsO(n*log(n))
for largen
, butheapq
module has been implemented intelligently, taking care that then
value passed tonlargest()
function will use heap only if n is relatively small and switch to sorting whenn
is significantly larger and tending to sizeof list. -
Joffer over 7 yearsnlargest from at least v2.7.11 will use a heap regardless, but v3.5.2 behaves as you describe @sinister
-
Joffer over 7 yearsactually, in v3.5.2, sorted is only used when n is larger than the size of the iterable @sinister