Fast algorithm for repeated calculation of percentile?

20,157

Solution 1

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

  1. Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

  1. Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

Solution 2

If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.

For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.

This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).

I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.

Share:
20,157
martinus
Author by

martinus

Software engineer at dynatrace. Bitcoin enthusiast, C++ developer

Updated on July 09, 2022

Comments

  • martinus
    martinus over 1 year

    In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

    1. Get value x
    2. Insert x in an already sorted array at the back
    3. swap x down until the array is sorted
    4. Read the element at position array[array.size * 3/4]

    Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

    UPDATE

    Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

    template<class T>
    class IterativePercentile {
    public:
      /// Percentile has to be in range [0, 1(
      IterativePercentile(double percentile)
        : _percentile(percentile)
      { }
    
      // Adds a number in O(log(n))
      void add(const T& x) {
        if (_lower.empty() || x <= _lower.front()) {
          _lower.push_back(x);
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
        } else {
          _upper.push_back(x);
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
        }
    
        unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
        if (_lower.size() > size_lower) {
          // lower to upper
          std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.push_back(_lower.back());
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.pop_back();
        } else if (_lower.size() < size_lower) {
          // upper to lower
          std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.push_back(_upper.back());
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.pop_back();
        }            
      }
    
      /// Access the percentile in O(1)
      const T& get() const {
        return _lower.front();
      }
    
      void clear() {
        _lower.clear();
        _upper.clear();
      }
    
    private:
      double _percentile;
      std::vector<T> _lower;
      std::vector<T> _upper;
    };