What is the difference between float and double?

c++ c floating-point precision ieee-754

1,107,589

Solution 1

Huge difference.

As the name implies, a double has 2x the precision of float^[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(2⁵³)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(2²⁴)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.

Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double^[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.

Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.

^{[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).}

Solution 2

Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.

The C++ standard adds:

The value representation of floating-point types is implementation-defined.

I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.

Solution 3

Given a quadratic equation: x² − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r₁ = 2.000316228 and r₂ = 1.999683772.

Using float and double, we can write a test program:

#include <stdio.h>
#include <math.h>

void dbl_solve(double a, double b, double c)
{
    double d = b*b - 4.0*a*c;
    double sd = sqrt(d);
    double r1 = (-b + sd) / (2.0*a);
    double r2 = (-b - sd) / (2.0*a);
    printf("%.5f\t%.5f\n", r1, r2);
}

void flt_solve(float a, float b, float c)
{
    float d = b*b - 4.0f*a*c;
    float sd = sqrtf(d);
    float r1 = (-b + sd) / (2.0f*a);
    float r2 = (-b - sd) / (2.0f*a);
    printf("%.5f\t%.5f\n", r1, r2);
}   

int main(void)
{
    float fa = 1.0f;
    float fb = -4.0000000f;
    float fc = 3.9999999f;
    double da = 1.0;
    double db = -4.0000000;
    double dc = 3.9999999;
    flt_solve(fa, fb, fc);
    dbl_solve(da, db, dc);
    return 0;
}

Running the program gives me:

2.00000 2.00000
2.00032 1.99968

Note that the numbers aren't large, but still you get cancellation effects using float.

(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)

Solution 4

A double is 64 and single precision (float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.

Solution 5

I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.

#include <iostream>
#include <iomanip>

int main(){
  for(float t=0;t<1;t+=0.01){
     std::cout << std::fixed << std::setprecision(6) << t << std::endl;
  }
}

The output is

As you can see after 0.83, the precision runs down significantly.

However, if I set up t as double, such an issue won't happen.

It took me five hours to realize this minor error, which ruined my program.

View more solutions

1,107,589

Author by

VaioIsBorn

Updated on December 31, 2021

Comments

VaioIsBorn over 2 years

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
R.. GitHub STOP HELPING ICE over 13 years

The usual advice for summation is to sort your floating point numbers by magnitude (smallest first) before summing.
Peter Mortensen about 11 years

For instance, all AVR doubles are floats (four-byte).
Peter Mortensen about 11 years

Actually, for float it is between 7 and 8, 7.225 to be exact.
BlueTrin over 7 years

just to be sure: the solution of your issue should be to use an int preferably ? If you want to iterate 100 times, you should count with an int rather than using a double
Richard over 6 years

Using double is not a good solution here. You use int to count and do an internal multiplication to get your floating-point value.
user207421 over 6 years

It doesn't mean that at all. It actually means twice as many integral decimal digits, and it is more than double. The relationship between fractional digits and precision is not linear: it depends on the value: e.g. 0.5 is precise but 0.33333333333333333333 is not.
plugwash about 5 years

Note that while C/C++ float and double are nearly always IEEE single and double precision respectively C/C++ long double is far more variable depending on your CPU, compiler and OS. Sometimes it's the same as double, sometimes it's some system-specific extended format, Sometimes it's IEEE quad precision.
InQusitive over 4 years

@R..GitHubSTOPHELPINGICE: why? Could you explain?
R.. GitHub STOP HELPING ICE over 4 years

@InQusitive: Consider for example an array consisting of the value 2^24 followed by 2^24 repetitions of the value 1. Summing in order produces 2^24. Reversing produces 2^25. Of course you can make examples (e.g. make it 2^25 repetitions of 1) where any order ends up being catastrophically wrong with a single accumulator but smallest-magnitude-first is the best among such. To do better you need some kind of tree.
chqrlie over 3 years

@R..GitHubSTOPHELPINGICE: summing is even more tricky if the array contains both positive and negative numbers.