Integers and float precision

10,658

Solution 1

In the sum two floats, is there any precision lost?

If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.

Why?

This is because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):

PS> [float]([float]0.0000001 + [float]1)
1

In the sum of a float and a integer, is there any precision lost?

Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.

Why?

Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:

PS> [float]2100000050 + [float]100
2100000100

Solution 2

The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:

3.12000000000 * 10 ^ 2

The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.

If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:

312.0 + 643.0 <==>

3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2

If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:

1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!

Floating point can handle very large, or very small numbers. But it can't represent both at the same time.

As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.

Solution 3

When adding two floating point numbers, there is generally some error. D. Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" describes the effect and the reasons in detail, and also how to calculate an upper bound on the error, and how to reason about the precision of more complex calculations.

When adding a float to an integer, the integer is first converted to a float by C++, so two floats are being added and error is introduced for the same reasons as above.

Solution 4

The precision available for a float is limited, so of course there is always the risk that any given operation drops precision.

The answer for both your questions is "yes".

If you try adding a very large float to a very small one, you will for instance have problems.

Or if you try to add an integer to a float, where the integer uses more bits than the float has available for its mantissa.

Solution 5

The short answer: a computer represents a float with a limited number of bits, which is often done with mantissa and exponent, so only a few bytes are used for the significant digits, and the others are used to represent the position of the decimal point.

If you were to try to add (say) 10^23 and 7, then it won't be able to accurately represent that result. A similar argument applies when adding a float and integer -- the integer will be promoted to a float.

Share:
10,658
nunos
Author by

nunos

Software Engineering / Sound and Music Computing

Updated on June 04, 2022

Comments

  • nunos
    nunos almost 2 years

    This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.

    In the sum two floats, is there any precision lost? Why?

    In the sum of a float and a integer, is there any precision lost? Why?

    Thanks.