In which order should floats be added to get the most precise result?

c++ floating-point precision

18,293

Solution 1

Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.

If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.

That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.

Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.

In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.

It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.

Solution 2

There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.

According to Wikipedia,

The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).

In pseudocode, the algorithm is:
function kahanSum(input)
 var sum = input[1]
 var c = 0.0          //A running compensation for lost low-order bits.
 for i = 2 to input.length
  y = input[i] - c    //So far, so good: c is zero.
  t = sum + y         //Alas, sum is big, y small, so low-order digits of y are lost.
  c = (t - sum) - y   //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
  sum = t             //Algebraically, c should always be zero. Beware eagerly optimising compilers!
 next i               //Next time around, the lost low part will be added to y in a fresh attempt.
return sum

Solution 3

I tried out the extreme example in the answer supplied by Steve Jessop.

#include <iostream>
#include <iomanip>
#include <cmath>

int main()
{
    long billion = 1000000000;
    double big = 1.0;
    double small = 1e-9;
    double expected = 2.0;

    double sum = big;
    for (long i = 0; i < billion; ++i)
        sum += small;
    std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
        std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    sum = 0;
    for (long i = 0; i < billion; ++i)
        sum += small;
    sum += big;
    std::cout  << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
        std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    return 0;
}

I got the following result:

1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371    (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933    (difference = 0.000000007460067)

The error in the first line is more than ten times bigger in the second.

If I change the doubles to floats in the code above, I get:

1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000    (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000    (difference = 0.968750000000000)

Neither answer is even close to 2.0 (but the second is slightly closer).

Using the Kahan summation (with doubles) as described by Daniel Pryden:

#include <iostream>
#include <iomanip>
#include <cmath>

int main()
{
    long billion = 1000000000;
    double big = 1.0;
    double small = 1e-9;
    double expected = 2.0;

    double sum = big;
    double c = 0.0;
    for (long i = 0; i < billion; ++i) {
        double y = small - c;
        double t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }

    std::cout << "Kahan sum  = " << std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    return 0;
}

I get exactly 2.0:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

And even if I change the doubles to floats in the code above, I get:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

It would seem that Kahan is the way to go!

Solution 4

There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.

In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.

Here is the abstract of a recent paper:

We present a novel, online algorithm for exact summation of a stream of floating-point numbers. By “online” we mean that the algorithm needs to see only one input at a time, and can take an arbitrary length input stream of such inputs while requiring only constant memory. By “exact” we mean that the sum of the internal array of our algorithm is exactly equal to the sum of all the inputs, and the returned result is the correctly-rounded sum. The proof of correctness is valid for all inputs (including nonnormalized numbers but modulo intermediate overflow), and is independent of the number of summands or the condition number of the sum. The algorithm asymptotically needs only 5 FLOPs per summand, and due to instruction-level parallelism runs only about 2--3 times slower than the obvious, fast-but-dumb “ordinary recursive summation” loop when the number of summands is greater than 10,000. Thus, to our knowledge, it is the fastest, most accurate, and most memory efficient among known algorithms. Indeed, it is difficult to see how a faster algorithm or one requiring significantly fewer FLOPs could exist without hardware improvements. An application for a large number of summands is provided.

Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.

Solution 5

I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:

while the list has multiple elements
    remove the two smallest elements from the list
    add them and put the result back in
the single element in the list is the result

Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:

template <typename Queue>
void reduce(Queue& queue)
{
    typedef typename Queue::value_type vt;
    while (queue.size() > 1)
    {
        vt x = queue.top();
        queue.pop();
        vt y = queue.top();
        queue.pop();
        queue.push(x + y);
    }
}

driver:

#include <iterator>
#include <queue>

template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
    typedef typename std::iterator_traits<Iterator>::value_type vt;
    std::priority_queue<vt> positive_queue;
    positive_queue.push(0);
    std::priority_queue<vt> negative_queue;
    negative_queue.push(0);
    for (; begin != end; ++begin)
    {
        vt x = *begin;
        if (x < 0)
        {
            negative_queue.push(x);
        }
        else
        {
            positive_queue.push(-x);
        }
    }
    reduce(positive_queue);
    reduce(negative_queue);
    return negative_queue.top() - positive_queue.top();
}

The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.

View more solutions

18,293

Yippie-Ki-Yay

Updated on June 30, 2020

Comments

Yippie-Ki-Yay almost 4 years
This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)

If we have some function, which accumulates floating-point numbers:
```
std::accumulate(v.begin(), v.end(), 0.0);
```
v is a std::vector<float>, for example.
- Would it be better to sort these numbers before accumulating them?
- Which order would give the most precise answer?
I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.

P.S. I do realize this probably has nothing to do with real world programming, just being curious.
- Zéychin almost 13 years
  
  This actually has everything to do with real-world programming. However, many applications don't really CARE about the absolute best accuracy of the calculation as long as it's 'pretty close'. Engineering applications? Extremely important. Medical applications? Extremely important. Large-scale statistics? Somewhat less accuracy is acceptable.
- Konrad Rudolph almost 13 years
  
  I think this qualifies as programming related so yes, it’s welcome here.
- Martin York almost 13 years
  
  Please don't answer unless you actually know and can point to a page the explains your reasoning in details. There is already so much crap about floating point numbers flying around we don;t want to add to it. If you think you know. STOP. because if you only think you know then you are probably wrong.
- BЈовић almost 13 years
  
  @Zéychin "Engineering applications? Extremely important. Medical applications? Extremely important."??? I think you would be surprised if you knew the truth :)
- Zéychin almost 13 years
  
  @VJo Alright, to be fair, not all Engineering and Medical applications need as much accuracy as possible. Many do. If a coordinate of a computer-assisted medical tool is off by a few hundredths of a radian, for example, there could be some severe problems.
- BЈовић almost 13 years
  
  @Zeychin Absolute error is irrelevant. What is important is relative error. If few hundredths of a radian is 0.001%, then who cares?
- Tony Delroy almost 13 years
  
  @Martin: guess we better ask Steve and Andrew to delete their answers then... no supporting pages cited ;-P
- rds almost 13 years
  
  This question can actually be extended to any language; this is more of a CPU issue than of a language issue.
- Steve Jessop almost 13 years
  
  @Tony: appeal to an authority is one way to do it, "things every programmer needs to know about floating point" or what-have-you. I'm not trying to solve the general problem, "how best to add up floating point numbers", this question is quite specifically looking at the effects of sorting. I'm not sure that an authoritative reference would concern itself with that, since sorting is inadequate to solve the general problem, so it's a bit of a curiosity once you have a killer case to "disprove" it. But if anyone has any good links it certainly wouldn't do any harm to produce them :-)
- rjmunro almost 13 years
  
  @Zéychin no application needs "as much accuracy as possible" because it is possible to use arbitrary precision algorithms running on supercomputer clusters to calculate results accurate to thousands of bits. Every application needs a certain level of accuracy and no more. There is no point trying to work out doses of drugs to more accuracy that can be manufactured, for example.
- Zéychin almost 13 years
  
  Perhaps I should have been more verbose: As much accuracy as possible given storage restraints/application requirements. If one has a 64-bit floating point number and the order of operations causes the answers to be significantly flawed, then a rearranged algorithm should be considered, or what is the point of using the extend precision if it's being wasted?
- Mohammad Alaggan about 8 years
  
  I really recommend this reading: "what every computer scientist needs to know about floating point" perso.ens-lyon.fr/jean-michel.muller/goldberg.pdf
Konrad Rudolph almost 13 years

+1 for explanation. This is somewhat counter-intuitive since addition is usually numerically stable (unlike subtraction and division).
MSN almost 13 years

@Konrad, it may be numerically stable, but it isn't precise given different magnitudes of operands :)
6502 almost 13 years

What if the array contains also -1?
Steve Jessop almost 13 years

@6502: they're sorted in order of magnitude, so the -1 comes at the end. If the true value of the total is of magnitude 1, then that's fine. If you're adding together three values: 1 / billion, 1 and -1, then, you'd get 0, at which point you have to answer the interesting practical question - do you need an answer that's accurate at the scale of the true sum, or do you only need an answer that's accurate at the scale of the largest values? For some practical applications, the latter is good enough, but when it isn't you need a more sophisticated approach. Quantum physics uses renormalization.
Neil almost 13 years

If you're going to stick with this simple scheme, I would always add the two numbers with the lowest magnitude and reinsert the sum in the set. (Well, probably a merge sort would work best here. You could use the portion of the array containing the previously summed numbers as a working area for the partial sums.)
Rudy Velthuis almost 13 years

Very informative post. I thought I knew about floating point, but this never occurred to me, although it makes perfect sense. Must add this to my article on floats. <g>
Chris A. almost 13 years

+1 lovely addition to this thread. Any compiler that "eagerly optimizes" those statements should be banned.
Steve Jessop almost 13 years

@Rudy: note though that I got my example wrong on the first attempt, despite having the luxury of choosing the numbers myself...
littlebroccoli almost 13 years

Nice answer. Would you mind explaining the relationship between the magnitude of the number and the rounding error? I seem to recall that there are many more floating point numbers crammed in between 0 and 1, but there are fewer numbers as you go higher towards +MAX_FLOAT, and therefore the rounding error would be larger for those larger numbers.
Andrew Stein almost 13 years

I tried out the example above. My conclusions are too long for a comment, so I have added an answer...
Steve Jessop almost 13 years

@Kevin Panko: The simple version is that a single-precision float has 24 binary digits, the greatest of which is the largest set bit in the number. So if you add together two numbers that differ in magnitude by more than 2^24, you suffer total loss of the smaller value, and if they differ in magnitude by a smaller degree then you lose a corresponding number of bits of accuracy of the smaller number.
Rudy Velthuis almost 13 years

@Steve: I didn't really look at the code. I found the idea of adding many small numbers to slowly get into the realm of the larger ones a good advice. Of course that only works if there are many small ones, but still, solid general advice.
Steve Jessop almost 13 years

My "big" value is equal to 1, not 1e9. Your second answer, added in increasing size order, is mathematically correct (1 billion, plus a billion billionths, is 1 billion and 1), although more by luck any any general soundness of the method :-) Note that double doesn't suffer bad loss of precision in adding together a billion billionths, since it has 52 significant bits, whereas IEEE float only has 24 and would.
Rudy Velthuis almost 13 years

Just thought of something: instead of ordering in order of magnitude, simply sort the numbers. Then split the numbers at 0: add the positive numbers from 0 on, add the negative numbers also from 0 on (i.e. backward), and only then add the two results you obtained. This avoids subtracting two almost equal numbers (or adding a positive and a negative value of almost same magnitude) and obtaining a very small value as intermediate result. IOW, the way you sort the numbers in order of magnitude makes the chance two numbers of equal magnitude but different sign are added rather big.
Neil almost 13 years

@Steve With just positive numbers, it's really easy, since you'll only need the original sorted list and a second queue (which you may be allowed to carve out of the original list), but I'm sure it can be done with numbers of mixed signs too.
Neil almost 13 years

@Rudy actually adding numbers of mixed signs is harder than I thought, just imagine your numbers were MAX_ODD_FLOAT + 1, 1.25, and -0.875, then you can get three different answers depending on the order in which you add them...
Steve Jessop almost 13 years

@Rudy: that falls somewhere on the sliding scale from "rubbish" to "use a bigfloat library and set the precision to 2^11 + 52". My reduction of 6502's case, {1 billionth, 1, -1} still kills it, since the positive sum is 1, the negative sum is -1, hence the total is 0. It's all a trade-off of working space and time vs accuracy, and what kinds of nasty inputs you have to deal with.
Rudy Velthuis almost 13 years

@Steve: sure, but the tiny number is killed anyway, when there are nearly as many big numbers as there are tiny ones. My gut feeling tells me that you should only introduce sign at the last moment. I'd have to think about it a little longer to produce proof, though. Hey, I'm a dentist, not a mathematician. <g>
Andrew Stein almost 13 years

@Steve, my error, apologies. I have updated the example code to what you intended.
Steve Jessop almost 13 years

@Rudy: I don't know if you can fairly say that it's "killed anyway": if by some combination of cleverness and luck we add the values in order {1, -1, 1 billionth} then we're fine. For that case, of the 6 possible orders 2 are good and 4 are bad. So the question becomes, which orders are we "allowed" to consider? The question points to ascending or descending, which I've brazenly changed to ascending/descending magnitude, you've suggested a scheme for two partial sums, Daniel Pryden another scheme with two running values -- it's all heuristics until we just throw super-precision at it.
Rudy Velthuis almost 13 years

@Steve: I guess you're right. Fact is that sorting does make a difference and your explanation made that very clear.
MSalters almost 13 years

It's a simple method to almost double the precision, by using two summation variables sum and c of differing magnitude. It can be trivially extended to N variables.
Steve Jessop almost 13 years

@Jerry: yep, that's what Neil said, although he didn't specify a priority queue. It doesn't maximize accuracy, it's killed by {1 billionth, 1, -1}, a case constructed specifically to have the property that if your first addition involves the tiny value, you lose.
Konrad Rudolph almost 13 years

@ChrisA. well you can explicitly control this on all compilers that count (e.g. via -ffast-math on GCC).
Steve Jessop almost 13 years

Kahan still has limited precision, but to construct a killer case you need both the main sum and the error accumulator c to contain values much bigger than the next summand. This means the summand is much, much smaller than the main sum, so there are going to have to be an awful lot of them to add up to much. Especially with double arithmetic.
Jerry Coffin almost 13 years

@Steve: sorry, I didn't look back quite far enough, and missed that.
Chris A. almost 13 years

@Konrad Rudolph thanks for pointing out that this is a possible optimization with -ffast-math. What I learned from this discussion and this link, is that if you care about numerical accuracy you should probably avoid using -ffast-math but that in many applications where you may be CPU-bound but don't care about precise numerical computations, (game programming for instance), -ffast-math is reasonable to use. Thus, I'd like to ammend my strongly worded "banned" comment.
NPE almost 13 years

@Inverse: There are still brick-and-mortar libraries around. Alternatively, buying the PDF online costs $5-$15 (depending on whether you're an ACM member). Lastly, DeepDyve seem to be offering to lend the paper for 24 hours for $2.99 (if you're new to DeepDyve, you might even be able to get it for free as part of their free trial): deepdyve.com/lp/acm/…
Pascal Cuoq over 9 years

“That will give you more additional precision than any other technique can” Do you realize that your answer comes more than one year after an earlier late answer that described how to use exact summation?
Jeff Hammond about 9 years

The "long double" type is horrible and you shouldn't use it.
G. Cohen about 5 years

Using double precision variables for sum, c, t, y will help. You also need to add sum -= c at before return sum.
ZachB over 4 years

This sounds somewhat like the method of Malcolm 1971 or, more so, its variant that uses the exponent by Demmel and Hida ("Algorithm 3"). There's another algorithm out there that does a carry-based loop like yours, but I can't find it at the moment.
rcgldr over 4 years

@ZachB - the concept is similar to bottom up merge sort for linked list, which also uses a small array, where array[i] points to list with 2^i nodes. I don't know how far back this goes. In my case, it was self discovery back in the 1970's.