Java - Looking for something faster than PriorityQueue

12,841

Solution 1

A heap-based priority queue is a good data structure for this problem. Just as a sanity check, verify that you are using the queue correctly.

If you want the highest weight items, use a min-queue—where the top of the heap is the smallest item. Adding every item to a max-queue and examining the top M items when done is not efficient.

For each item, if there are less than M items in the queue, add the current item. Otherwise, peek at the top of the heap. If it's less than the current item, discard it, and add the current item instead. Otherwise, discard the current item. When all items have been processed, the queue will contain the M highest-weight items.

Some heaps have shortcut APIs for replacing the top of the heap, but Java's Queue does not. Even so, the big-O complexity is the same.

Solution 2

In addition to the suggested "peek at the top of the heap" algorithm, which gives you O(n log m) complexity for getting the top-m of n items, here are two more solutions.

Solution 1: Use a Fibonacci heap.

The JDK's PriorityQueue implementation is a balanced binary heap. You should be able to squeeze more performance out of a Fibonacci heap implementation. It will have amortized constant time insert, while inserting into a binary heap has complexity Ω(log n) in the size of the heap. If you're doing that for every element, then you're at Ω(n log n). Finding the top-m of n items using a Fib heap has complexity O(n + m log n). Combine this with the suggestion to only ever insert m elements into the heap, and you have O(n + m log m), which is as close to linear time as you're going to get.

Solution 2: Traverse the list M times.

You should be able to get the kth-largest element in a set in O(n) time. Simply read everything into a list and do the following:

kthLargest(k, xs)
  Pick a random pivot element p from the list
    (the first one will do if your list is already random).
  Go over the set once and group it into two lists.
     Left: smaller than p. 
     Right: Larger or equal to p.
  If the Right list is shorter than k, return kthLargest(k - right.size, Left)
  If the Right list is longer than k, return kthLargest(k, right)
  Otherwise, return p.

That gives you O(n) time. Running that m times, you should be able to get the top-m objects in your set in time O(nm), which will be strictly less than n log n for sufficiently large n and sufficiently small m. For example, getting the top-10 over a million items will take half as long as using a binary heap priority queue, all other things being equal.

Solution 3

If M is suitably small, then sorting all elements may waste a lot of computing time. You could only put the first M objects in priority queue (e.g. a heap, minimal element on top), and then iterate over the rest of the elements: every time an element is larger than the top of the heap, remove top and push new element into the heap.

Alternatively, you could iterate over the whole array once to find a statistical threshold value for which you can be very sure there are more than M objects with a larger value (will require some assumptions regarding the values, e.g. if they are normally distributed). You can then limit sorting to all elements with a larger value.

Share:
12,841
BigG
Author by

BigG

Updated on June 04, 2022

Comments

  • BigG
    BigG almost 2 years

    i'm using java on a big amount of data.

    [i try to simplify the problem as much as possible]

    Actually i have a small class (Element) containing an int KEY and a double WEIGHT (with getters&setters).

    I read a lot of these objects from a file and i have to get the best (most weight) M objects.

    Actually i'm using a PriorityQueue with a Comparator written to compare two Element, and it works, but it's too slow.

    Do you know (i know you do) any faster way to do that?

    Thank you

  • Lawrence Dol
    Lawrence Dol over 14 years
    And, of course, this comparator is good provided it is guaranteed that the difference between i and j never exceeds Integer.MAX_VALUE.
  • brady
    brady over 14 years
    In general, subtraction is a poor choice for implementing comparison on floating-point values (the question clearly states that the weight is a double). If the difference is less than one, it will incorrectly be coerced to zero when casting the result to an int.
  • Apocalisp
    Apocalisp over 14 years
    Good suggestion. The complexity of this algorithm is O(n log m) for getting the top-m of n total elements.
  • Michael Borgwardt
    Michael Borgwardt over 14 years
    Your claim about the speed difference factor between a Fibonacci heap and a binary heap assumes a binary logarithm and no difference in constant factors, i.e. it's probably untrue.
  • Jørgen Fogh
    Jørgen Fogh over 14 years
    @Software Monkey: True. @erickson: I hadn't noticed that we were using floating-point values.