C++ Efficiently Calculating a Running Median

c++ algorithm median

37,732

Solution 1

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Code in Scheme and Python is available at my blog.

Solution 2

The Jeff McClintock running median estimate. Requires keeping only two values. This example iterates over an array of sampled values (CPU consumption). Seems to converge relatively quickly (about 100 samples) to an estimate of the median. The idea is at each iteration the median inches toward the input signal at a constant rate. The rate depends on what magnitude you estimate the median to be. I use the average as an estimate of the magnitude of the median, to determines the size of each increment of the median. If you need your median accurate to about 1%, use a step-size of 0.01 * the average.

float median = 0.0f;
float average = 0.0f;

// for each sample
{
    average += ( sample - average ) * 0.1f; // rough running average.
    median += _copysign( average * 0.01, sample - median );
}

enter image description here

Solution 3

One solution would be to maintain an order statistic tree, inserting each element of the sequence in turn, then compute the median of the elements in the tree.

This would take O(lg n) time per insertion and O(lg n) time per median, for a total of O(n lg n) time, plus O(n) space.

Solution 4

Here is a C++ balanced tree structure that provides the ability to query by index in the sorted list. Since it maintains all values in sorted order, this is not be quite as efficient as the two-heaps approach, but it offers some additional flexibility. For example, it could also give you a running quartile.

template <typename T>
class Node
{
public:
    T key;
    Node* left;
    Node* right;
    size_t size;

    Node(T k) : key(k)
    {
        isolate();
    }

    ~Node()
    {
        delete(left);
        delete(right);
    }

    void isolate()
    {
        left = NULL;
        right = NULL;
        size = 1;
    }

    void recount()
    {
        size = 1 + (left ? left->size : 0) + (right ? right->size : 0);
    }

    Node<T>* rotateLeft()
    {
        Node<T>* c = right;
        Node<T>* gc = right->left;
        right = gc;
        c->left = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* rotateRight()
    {
        Node<T>* c = left;
        Node<T>* gc = left->right;
        left = gc;
        c->right = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* balance()
    {
        size_t lcount = left ? left->size : 0;
        size_t rcount = right ? right->size : 0;
        if((lcount + 1) * 2 < (rcount + 1))
        {
            size_t lcount2 = right->left ? right->left->size : 0;
            size_t rcount2 = right->right ? right->right->size : 0;
            if(lcount2 > rcount2)
                right = right->rotateRight();
            return rotateLeft();
        }
        else if((rcount + 1) * 2 <= (lcount + 1))
        {
            size_t lcount2 = left->left ? left->left->size : 0;
            size_t rcount2 = left->right ? left->right->size : 0;
            if(lcount2 < rcount2)
                left = left->rotateLeft();
            return rotateRight();
        }
        else
        {
            recount();
            return this;
        }
    }

    Node<T>* insert(Node<T>* newNode)
    {
        if(newNode->key < key)
        {
            if(left)
                left = left->insert(newNode);
            else
                left = newNode;
        }
        else
        {
            if(right)
                right = right->insert(newNode);
            else
                right = newNode;
        }
        return balance();
    }

    Node<T>* get(size_t index)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            return left->get(index);
        else if(index > lcount)
            return right ? right->get(index - lcount - 1) : NULL;
        else
            return this;
    }

    Node<T>* find(T k, size_t start, size_t* outIndex)
    {
        if(k < key)
            return left ? left->find(k, start, outIndex) : NULL;
        else if(k > key)
            return right ? right->find(k, left ? start + left->size + 1 : start + 1, outIndex) : NULL;
        else
        {
            if(outIndex)
                *outIndex = start + (left ? left->size : 0);
            return this;
        }
    }

    Node<T>* remove_by_index(size_t index, Node<T>** outNode)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            left = left->remove_by_index(index, outNode);
        else if(index > lcount)
            right = right->remove_by_index(index - lcount - 1, outNode);
        else
        {
            *outNode = this;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }

    Node<T>* remove_by_value(T k, Node<T>** outNode)
    {
        if(k < key)
        {
            if(!left)
                throw "not found";
            left = left->remove_by_value(k, outNode);
        }
        else if(k > key)
        {
            if(!right)
                throw "not found";
            right = right->remove_by_value(k, outNode);
        }
        else
        {
            *outNode = this;
            size_t lcount = left ? left->size : 0;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }
};

template <typename T>
class MyReasonablyEfficientRunningSortedIndexedCollection
{
private:
    Node<T>* root;
    Node<T>* spare;

public:
    MyReasonablyEfficientRunningSortedIndexedCollection() : root(NULL), spare(NULL)
    {
    }

    ~MyReasonablyEfficientRunningSortedIndexedCollection()
    {
        delete(root);
        delete(spare);
    }

    void insert(T key)
    {
        if(spare)
            spare->key = key;
        else
            spare = new Node<T>(key);
        if(root)
            root = root->insert(spare);
        else
            root = spare;
        spare = NULL;
    }

    void drop_by_index(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        delete(spare);
        root = root->remove_by_index(index, &spare);
        spare->isolate();
    }

    void drop_by_value(T key)
    {
        if(!root)
            throw "out of range";
        delete(spare);
        root = root->remove_by_value(key, &spare);
        spare->isolate();
    }

    T get(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        return root->get(index)->key;
    }

    size_t find(T key)
    {
        size_t outIndex;
        Node<T>* node = root ? root->find(key, 0, &outIndex) : NULL;
        if(node)
            return outIndex;
        else
            throw "not found";
    }

    size_t size()
    {
        return root ? root->size : 0;
    }
};

Solution 5

Algorithm of rolling window median:

median is a sorted array where you take from it the middle value.

simple rolling implementation is with a queue(dqueue) and a sorted_array (any implementation, binary tree, skiparray).

d_queue is an array where you can push to tail and shift (pop) from the front of the array.

sorted_array is an array where you insert by order at position found using binary search.

I used a queue (first-in-first-out array) to track order of added values to know which items to remove from the median array, when they the queue is longer than the wanted size. to fall off elements by date time or some running index, it is possible to add another queue and check the first element is too old, and decide if to remove first value from both queues.

To calculate a median efficiently I use a sorted array technique. it is when you insert new items into its sorted place, so the array is always sorted.

The insert:
- Insert at ordered place in the sorted_array,
- and push a value into a queue.
The remove:
- If d_queue first element is off the window, or if in an another queue you can have with indexes, the index is too old, then:
  - remove the first item from d_queue(s),
  - and binary search for it in the sorted array and remove it.
To have the median:
- Use the value(s) in the middle of the sorted_array.
- If sorted_array length is even use the item in the middle.
- If sorted_array length is odd use an average of two items in the middle.

View more solutions

37,732

Edge

Updated on June 29, 2020

Comments

Edge almost 4 years

Those of you that have read my previous questions know about my work at understanding and implementing quicksort and quickselect, as well as some other basic algorithms.

Quickselect is used to calculate the kth smallest element in an unsorted list, and this concept can also be used to find the median in an unsorted list.

This time, I need aid in devising an efficient technique to calculate the running median, because quickselect isn't a good choice as it needs to re-calculate every time the list changes. Because quickselect has to restart everytime, it can't take advantage of previous calculations done, so I'm looking for a different algorithm that's similar (possibly) but is more efficient in the area of running medians.
- Adrian over 9 years
  
  This can be done in linear time by using partition from quick sort algo but has worst time n^2. Pick random point in your collection as pivot and move the other elems so that elems smaller than pivot are on the left and larger or equal are on the right. If pivot is in the middle it is the median, if not go to the chunk that has the median (the larger size chunk). Repeat. Another algo that guarantees linear time it median of medians described in CLRS and I believe also on wikipedia. Look those up.
Edge about 12 years

Is that tree type good for this purpose? I haven't heard of it before.
Fred Foo about 12 years

Order statistic trees allow indexing, i.e. getting the k'th smallest element of a sequence, in logarithmic time.
Edge about 12 years

Will this work with an unsorted list?
Fred Foo about 12 years

Yes, you just insert the items into the tree and they will get sorted. An order statistic tree is an augmented BST.
Edge about 12 years

I notice the page you linked contains pseudo code, not actual code. I understand basic BST's, but not much more, and implementing that tree would take a while following pseudo code.
Edge about 12 years

Actually, nevermind. Looks pretty simple to me. Are there any other references that discuss more indepth how the order stat tree works?
Edge about 12 years

Is there any code for implementation with C++? Thanks for the reply by the way, appreciated. I like the verbose explanation.
Edge about 12 years

Also, does your idea only work on sorted lists, or unsorted lists as well?
Fred Foo about 12 years

@Andrew: en.wikipedia.org/wiki/Introduction_to_Algorithms
Chisholm about 12 years

@Andrew, have you looked at boost accumulators?
Edge about 12 years

I know nothing of boost functions.
user448810 about 12 years

Andrew: The input need not be sorted. The heaps do the sorting internally.
Edge about 12 years

I'm having a lot of trouble implementing this in C++. I don't really know where to start, the python code on your blog seems very different.
Edge about 12 years

There's lots of function references to pop and push, but no implementation of them.
user448810 about 12 years

For the Scheme solution, the top of the second page of the blog entry has three links to different implementations of priority queues, using three different algorithms. All of the Python comments import priority queues from Python's standard library. And C++ provides a priority queue in its STL, which you may decide to use instead of writing your own priority queue.
Notinlist over 8 years

Adding elements to the system is OK. But how removing old elements would work?
orzechow about 7 years

While this solution is very efficient, beware of the following caveats: 1) convergence rate depends on signal amplitude (compare step responses with different offsets and amplitudes), therefore does not converge against signal near zero! 2) for near constant input signal this estimate introduces jitter with amplitude of average * 0.01 and frequency of sample rate 3) deflects on short impulses (which a median originally does not, thus being popular as pepper and noise filter)
Fabio says Reinstate Monica almost 7 years

Hello, welcome to SO. Your answer contains only code. It would be better if you could also add some commentary to explain what it does and how. Can you please edit your answer and add it? Thank you!
Jader Dias over 6 years

Implementation in Go github.com/JaderDias/movingmedian
sircolinton almost 5 years

Using the rolling standard deviation instead of the rolling mean to scale the step-increment may be an interesting direction for log-normally distributed data (which accounts for many/most naturally arising processes). Using a measure of variability instead of mean will address the issue of jitter raised by orzechow. Unfortunately, variance is not robust, so jitter may be re-introduced by large transient outliers. It might then be interesting to stack filters, which I would expect to have the effect of increasing the effective window width.