Percentiles of Live Data Capture

17,654

Solution 1

I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.

This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.

If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.

Solution 2

If you want to keep the memory usage constant as you get more and more data, then you're going to have to resample that data somehow. That implies that you must apply some sort of rebinning scheme. You can wait until you acquire a certain amount of raw inputs before beginning the rebinning, but you cannot avoid it entirely.

So your question is really asking "what's the best way of dynamically binning my data"? There are lots of approaches, but if you want to minimise your assumptions about the range or distribution of values you may receive, then a simple approach is to average over buckets of fixed size k, with logarithmically distributed widths. For example, lets say you want to hold 1000 values in memory at any one time. Pick a size for k, say 100. Pick your minimum resolution, say 1ms. Then

  • The first bucket deals with values between 0-1ms (width=1ms)
  • Second bucket: 1-3ms (w=2ms)
  • Third bucket: 3-7ms (w=4ms)
  • Fourth bucket: 7-15ms (w=8ms)
  • ...
  • Tenth bucket: 511-1023ms (w=512ms)

This type of log-scaled approach is similar to the chunking systems used in hash table algorithms, used by some filesystems and memory allocation algorithms. It works well when your data has a large dynamic range.

As new values come in, you can choose how you want to resample, depending on your requirements. For example, you could track a moving average, use a first-in-first-out, or some other more sophisticated method. See the Kademlia algorithm for one approach (used by Bittorrent).

Ultimately, rebinning must lose you some information. Your choices regarding the binning will determine the specifics of what information is lost. Another way of saying this is that the constant size memory store implies a trade-off between dynamic range and the sampling fidelity; how you make that trade-off is up to you, but like any sampling problem, there's no getting around this basic fact.

If you're really interested in the pros and cons, then no answer on this forum can hope to be sufficient. You should look into sampling theory. There's a huge amount of research on this topic available.

For what it's worth, I suspect that your server times will have a relatively small dynamic range, so a more relaxed scaling to allow higher sampling of common values may provide more accurate results.

Edit: To answer your comment, here's an example of a simple binning algorithm.

  • You store 1000 values, in 10 bins. Each bin therefore holds 100 values. Assume each bin is implemented as a dynamic array (a 'list', in Perl or Python terms).
  • When a new value comes in:

    • Determine which bin it should be stored in, based on the bin limits you've chosen.
    • If the bin is not full, append the value to the bin list.
    • If the bin is full, remove the value at the top of the bin list, and append the new value to the bottom of the bin list. This means old values are thrown away over time.
  • To find the 90th percentile, sort bin 10. The 90th percentile is the first value in the sorted list (element 900/1000).

If you don't like throwing away old values, then you can implement some alternative scheme to use instead. For example, when a bin becomes full (reaches 100 values, in my example), you could take the average of the oldest 50 elements (i.e. the first 50 in the list), discard those elements, and then append the new average element to the bin, leaving you with a bin of 51 elements that now has space to hold 49 new values. This is a simple example of rebinning.

Another example of rebinning is downsampling; throwing away every 5th value in a sorted list, for example.

I hope this concrete example helps. The key point to take away is that there are lots of ways of achieving a constant memory aging algorithm; only you can decide what is satisfactory given your requirements.

Solution 3

I've once published a blog post on this topic. The blog is now defunct but the article is included in full below.

The basic idea is to reduce the requirement for an exact calculation in favor of "95% percent of responses take 500ms-600ms or less" (for all exact percentiles of 500ms-600ms).


As we’ve recently started feeling that response times of one of our webapps got worse, we decided to spend some time tweaking the apps’ performance. As a first step, we wanted to get a thorough understanding of current response times. For performance evaluations, using minimum, maximum or average response times is a bad idea: “The ‘average’ is the evil of performance optimization and often as helpful as ‘average patient temperature in the hospital'” (MySQL Performance Blog). Instead, performance tuners should be looking at the percentile: “A percentile is the value of a variable below which a certain percent of observations fall” (Wikipedia). In other words: the 95th percentile is the time in which 95% of requests finished. Therefore, a performance goals related to the percentile could be similar to “The 95th percentile should be lower than 800 ms”. Setting such performance goals is one thing, but efficiently tracking them for a live system is another one.

I’ve spent quite some time looking for existing implementations of percentile calculations (e.g. here or here). All of them required storing response times for each and every request and calculate the percentile on demand or adding new response times in order. This was not what I wanted. I was hoping for a solution that would allow memory and CPU efficient live statistics for hundreds of thousands of requests. Storing response times for hundreds of thousands of requests and calculating the percentile on demand does neither sound CPU nor memory efficient.

Such a solution as I was hoping for simply seems not to exist. On second thought, I came up with another idea: For the type of performance evaluation I was looking for, it’s not necessary to get the exact percentile. An approximate answer like “the 95th percentile is between 850ms and 900ms” would totally suffice. Lowering the requirements this way makes an implementation extremely easy, especially if upper and lower borders for the possible results are known. For example, I’m not interested in response times higher than several seconds – they are extremely bad anyway, regardless of being 10 seconds or 15 seconds.

So here is the idea behind the implementation:

  1. Define any random number of response time buckets (e.g. 0-100ms, 100-200ms, 200-400ms, 400-800ms, 800-1200ms, …)
  2. Count number of responses and number of response each bucket (For a response time of 360ms, increment the counter for the 200ms – 400ms bucket)
  3. Estimate the n-th percentile by summing counter for buckets until the sum exceeds n percent of the total

It’s that simple. And here is the code.

Some highlights:

public void increment(final int millis) {
    final int i = index(millis);
    if (i < _limits.length) {
        _counts[i]++;
    }
    _total++;
}
 
public int estimatePercentile(final double percentile) {
    if (percentile < 0.0 || percentile > 100.0) {
        throw new IllegalArgumentException("percentile must be between 0.0 and 100.0, was " + percentile);
    }
 
    for (final Percentile p : this) {
        if (percentile - p.getPercentage() <= 0.0001) {
            return p.getLimit();
        }
    }
    return Integer.MAX_VALUE;
}

This approach only requires two int values (= 8 byte) per bucket, allowing to track 128 buckets with 1K of memory. More than sufficient for analysing response times of a web application using a granularity of 50ms). Additionally, for the sake of performance, I’ve intentionally implemented this without any synchronization(e.g. using AtomicIntegers), knowing that some increments might get lost.

By the way, using Google Charts and 60 percentile counters, I was able to create a nice graph out of one hour of collected response times:

percentiles graph

Solution 4

(It's been quite some time since this question was asked, but I'd like to point out a few related research papers)

There has been a significant amount of research on approximate percentiles of data streams in the past few years. A few interesting papers with full algorithm definitions:

All of these papers propose algorithms with sub-linear space complexity for the computation of approximate percentiles over a data stream.

Solution 5

Try the simple algorithm defined in the paper “Sequential Procedure for Simultaneous Estimation of Several Percentiles” (Raatikainen). It’s fast, requires 2*m+3 markers (for m percentiles) and tends to an accurate approximation quickly.

Share:
17,654

Related videos on Youtube

Jason Kresowaty
Author by

Jason Kresowaty

Jason Kresowaty

Updated on November 11, 2021

Comments

  • Jason Kresowaty
    Jason Kresowaty over 2 years

    I am looking for an algorithm that determines percentiles for live data capture.

    For example, consider the development of a server application.

    The server might have response times as follows: 17 ms 33 ms 52 ms 60 ms 55 ms etc.

    It is useful to report the 90th percentile response time, 80th percentile response time, etc.

    The naive algorithm is to insert each response time into a list. When statistics are requested, sort the list and get the values at the proper positions.

    Memory usages scales linearly with the number of requests.

    Is there an algorithm that yields "approximate" percentile statistics given limited memory usage? For example, let's say I want to solve this problem in a way that I process millions of requests but only want to use say one kilobyte of memory for percentile tracking (discarding the tracking for old requests is not an option since the percentiles are supposed to be for all requests).

    Also require that there is no a priori knowledge of the distribution. For example, I do not want to specify any ranges of buckets ahead of time.

  • Jason Kresowaty
    Jason Kresowaty over 14 years
    Thank you for your good insights, but I can't glean enough from this to actually do an implementation. The links you gave don't mention percentiles or "rebinning". You wouldn't happen to know of any references which are dedicated to the topic at hand?
  • Jason Kresowaty
    Jason Kresowaty over 14 years
    <If you really are doing statistics on a server application> I am interesting in collecting more kinds of statistics, not just response times. It is not always easy to determine proper bounds. So, I'm looking for a general-purpose solution. Thanks.
  • ire_and_curses
    ire_and_curses over 14 years
    @binarycoder: I've added an example to my answer to try and make what I'm saying a little more concrete. Hope it helps.
  • Jason Kresowaty
    Jason Kresowaty over 14 years
    I dunno. This replacement algorithm would seem to clearly introduce bias against old data. This is why I'd really appreciate a proper mathematical argument as to the robustness of any solution.
  • redtuna
    redtuna over 14 years
    If the live data is taken from some distribution D, then a subsampling -any subsampling- will also derive from D. If the live data instead is not taken from some distribution, then a list of percentiles might not be the most enlightening thing to look for.
  • Jason Kresowaty
    Jason Kresowaty over 14 years
    Keywords are helpful.. Searching for "quantile" and "stream" bring up all kinds of research on this subject! All of techniques seem a lot more involved than any of the algorithms suggested here. Which is why I'm hesitant to mark anything as "the answer".
  • Jason Kresowaty
    Jason Kresowaty over 14 years
    I am accepting this as the "best" answer. But to do an unbiased "reservoir sampling" p must be reservoirSize/totalSamplesSoFar. Also, the element to evict must be chosen at random (not the oldest).
  • Jason Kresowaty
    Jason Kresowaty over 14 years
    Although some applications will need a more sophisticated bucketing algorithm, that sure is a really cool way to display percentile data!
  • sfussenegger
    sfussenegger over 14 years
    I've just changed the colors of the chart (was j.mp/kj6sW) and the result is even cooler. Now it's quite easy to get approximate percentiles for the last 60 minutes of the application's responses. Might be that some applications need exact data. For most web applications (and similar severs) it should be perfectly sufficient though.
  • Julien Genestoux
    Julien Genestoux almost 12 years
    That's a nice pragmatic approach! LIKE
  • Nicolas Mommaerts
    Nicolas Mommaerts over 11 years
    Awesome! Was looking for something for a Java algorithm like this, thanks!
  • LordOfThePigs
    LordOfThePigs over 10 years
    It seems to me your example wouldn't really work well. It assumes that you've sized your buckets perfectly and that you get 100 values per bucket. This is a pretty strong assumption. Your buckets are not very likely to be sized to receive exactly the same number of values, and therefore the smallest value of your 10th bucket is probably not your 90th percentile.
  • redtuna
    redtuna over 9 years
    Thank you @JasonKresowaty, you are absolutely right. I described a biased sampling, and your change makes it unbiased. One might argue that biasing in favor of new data is a good thing as you want recent statistics, but now we have both approaches here so people can choose what is best for them.