How dangerous is it to compare floating point values?

objective-c ios c floating-point floating-accuracy

95,035

Solution 1

First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.

First of all, if you want to program with floating point, you should read this:

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)

Now, with that said, the biggest issues with exact floating point comparisons come down to:

The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).

When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)

Edit: In particular, a magnitude-relative epsilon check should look something like:

if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))

Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).

Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:

if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)

and likewise substitute DBL_MIN if using doubles.

Solution 2

Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.

To clarify a little, a computation such as :

areal = 0.0

will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as

areal = 1.386 - 2.1*(0.66)

may not.

If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.

In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.

For Angry Birds, not so dangerous.

Solution 3

I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.

Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.

So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.

However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.

Solution 4

Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.

Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.

This is great, but will quickly fall apart under pressure. Two scenarios spring to mind: 1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest. 2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.

(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)

Solution 5

[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]

Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:

#define VISIBLE_SHIFT    0.0001        // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }

However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.

Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:

K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be)

So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?

We are going to use, as the 'right answer' details:

if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)

Let's just try all values of K:

#include <math.h>
#include <float.h>
#include <stdio.h>

void main (void)
{
  double x = 1e-13;
  double y = 0.0;

  double K = 1e22;
  int i = 0;

  for (; i < 32; i++, K = K/10.0)
    {
      printf ("K:%40.16lf -> ", K);

      if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
        printf ("YES\n");
      else
        printf ("NO\n");
    }
}
ebg@ebg$ gcc -o test test.c
ebg@ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K:  100000000000000000000.0000000000000000 -> YES
K:   10000000000000000000.0000000000000000 -> YES
K:    1000000000000000000.0000000000000000 -> YES
K:     100000000000000000.0000000000000000 -> YES
K:      10000000000000000.0000000000000000 -> YES
K:       1000000000000000.0000000000000000 -> NO
K:        100000000000000.0000000000000000 -> NO
K:         10000000000000.0000000000000000 -> NO
K:          1000000000000.0000000000000000 -> NO
K:           100000000000.0000000000000000 -> NO
K:            10000000000.0000000000000000 -> NO
K:             1000000000.0000000000000000 -> NO
K:              100000000.0000000000000000 -> NO
K:               10000000.0000000000000000 -> NO
K:                1000000.0000000000000000 -> NO
K:                 100000.0000000000000000 -> NO
K:                  10000.0000000000000000 -> NO
K:                   1000.0000000000000000 -> NO
K:                    100.0000000000000000 -> NO
K:                     10.0000000000000000 -> NO
K:                      1.0000000000000000 -> NO
K:                      0.1000000000000000 -> NO
K:                      0.0100000000000000 -> NO
K:                      0.0010000000000000 -> NO
K:                      0.0001000000000000 -> NO
K:                      0.0000100000000000 -> NO
K:                      0.0000010000000000 -> NO
K:                      0.0000001000000000 -> NO
K:                      0.0000000100000000 -> NO
K:                      0.0000000010000000 -> NO

Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.

So, I'd say you have two options:

Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.

View more solutions

95,035

Author by

Proud Member

Eu gosto de queijo, pizza, e programação.

Updated on July 08, 2022

Comments

Proud Member almost 2 years
I know UIKit uses CGFloat because of the resolution independent coordinate system.

But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
```
if (theView.frame.origin.x == 0) {
    // do important operation
}
```
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >? It is a floating point and they have unprecision problems: 0.0000000000041 for example.

Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
Romain about 12 years

One aspect of resolution-independence is that you cannot tell for sure what a "visible shift" is at compile-time. What is invisible on a super-HD screen might very well be obvious on a tiny-ass screen. One should at least make it a function of screen size. Or name it something else.
Daniel Fischer about 12 years

fabs(x+y) is problematic if x and y (can) have different sign. Still, a good answer against the tide of cargo-cult comparisons.
R.. GitHub STOP HELPING ICE about 12 years

If x and y have different sign, it's no problem. The righthand side will be "too small", but since x and y have different sign, they should not compare equal anyway. (Unless they are so small as to be denormal, but then the second case catches it)
Nicolás Ozimica about 12 years

I'm curious about your statement: "especially for buggy, non-conformant compilers like GCC". Is really GCC buggy and also non-conformant?
Nicolás Ozimica about 12 years

Thanks @R.. for your complete explanations! It's good to know that the option -std=c99 is useful to circumvent this problem.
Stephen Canon almost 12 years

Since the question is tagged iOS, it's worth noting that Apple's compilers (both clang and Apple's gcc builds) have always used FLT_EVAL_METHOD = 0, and attempt to be completely strict about not carrying excess precision. If you find any violations of that, please file bug reports.
Stephen Canon almost 12 years

@R..: Apple never shipped an x86 processor without SSE2. All arithmetic other than long double is done on SSE, even on 32-bit.
R.. GitHub STOP HELPING ICE almost 12 years

OK that makes sense, but I think your usage of "Apple's compilers" was unclear. I assumed it applied to clang (Apple's compiler) on any target rather than just on Apple systems.
Zarel almost 12 years

This answer doesn't seem to explain how "erring on the side of assuming floating point results have large random fuzz associated with them will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases."
Tom Swirly almost 12 years

This is Not A Good Answer. First, the whole "object thing" does nothing whatsoever to solve your issue. And second, your actual implementation of "equality" isn't in fact the correct one.
John White over 11 years

Tom, maybe you'd think again about the "object thing". With real numbers, represented to high precision, equality rarely happens. But one's idea of equality may be tailored if it suits you. It would be nicer if there was an overridable 'approximately equal' operator, but there ain't.
Nemo over 11 years

@R.: You might find this question interesting. I would certainly like to know your opinion on it, if any.
Christian Rau almost 11 years

"First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages." - Just two sentences and already earned a +1! That's one of the most disturbing misassumptions people make when working with floating points.
Pascal Cuoq almost 11 years

Actually, 1.30 - 2*(0.65) is a perfect example of an expression that obviously evaluates to 0.0 if your compiler implements IEEE 754, because the doubles represented as 0.65 and 1.30 have the same significands, and multiplication by two is obviously exact.
High Performance Mark over 10 years

Still getting rep from this one, so I changed the second example snippet.
hfossli almost 10 years

@R.. Would you approve this? gist.github.com/hfossli/4616c778bea3a334f034 I replaced "K" with "accuracy" if that makes sense.
supercat almost 10 years

An issue not yet mentioned in this answer's list is that if a computation yields a "not a number" result, that answer will compare unequal to itself. That's not an issue when comparing a variable to a constant that isn't a NaN, but can be an issue when e.g. testing whether a floating-point value is in a table.
Kartick Vaddadi about 8 years

What's the absolute error? In other words, what's the smallest value of ERR for which fabs(x-y) < ERR is valid for all x and y?
GoZoner over 7 years

But at least selecting 'visible shift' is based on easily understood display (or frame) properties - unlike the <correct answer's> K which is difficult and non-intuitive to select.
chux - Reinstate Monica about 7 years

Casting to int without insuring theView.frame.origin.x is in/near that range of int leads to undefined behavior (UB) - or in this case, 1/100th the range of int.
CyberMew about 6 years

So K should be a value more than 1 if we were to make it less accurate? Meaning more margin for error.
Albert Renshaw over 5 years

Note __FLT_EPSILON__ can be used if you do not desire to import float.h
R.. GitHub STOP HELPING ICE over 5 years

@AlbertRenshaw: That's not C. It's a gcc feature meant to be used only for implementation internals. float.h is a standard freestanding header and there's no reason not to include it.
Sneftel over 5 years

There's absolutely no reason to convert to integer like this. As chux said, there's the potential for UB from out-of-range values; and on some architectures this will be significantly slower than just doing the computation in floating point. Lastly, multiplying by 100 like that will compare with 0.01 precision, not 0.1.
chux - Reinstate Monica almost 4 years

If code is using FLT_EPSILON implying a float computation, at least in C, makes more sense to use float fabsf(float), than double fabs(double ).
Maggyero about 3 years

Could you explain why you compute fabs(x-y) < K * FLT_EPSILON * fabs(x+y) instead of fabs(x-y) < K * FLT_EPSILON * fabs(x) or fabs(x-y) < K * FLT_EPSILON * fabs(y) which express a relative error?
Maggyero about 3 years

@DanielFischer Any idea why the author did this?
R.. GitHub STOP HELPING ICE about 3 years

@Maggyero: Because the form I put it in is symmetric in x and y, which is a property you would normally expect. Assuming x and y are close, x+y is approximately equal to 2*x or 2*y anyway (you could divide out that extra factor of 2 if you like) but doesn't have unexpected behavior from asymmetry.
Maggyero about 3 years

I like that symmetry argument. But is it correct to say that the three formulas ([…] fabs(x+y), […] fabs(x), […] fabs(y)) are basically equivalent (beside only the first one having the symmetry property), in the sense that they all give the same results in practice? And do you have a reference for the symmetric formula or it is from you (I cannot find it in David Goldberg’s article)?