What is overflow and underflow in floating point
Of course the following is implementation dependent, but if the numbers behave anything like what IEEE-754 specifies, Floating point numbers do not overflow and underflow to a wildly incorrect answer like integers do, e.g. you really should not end up with two positive numbers being multiplied resulting in a negative number.
Instead, overflow would mean that the result is 'too large to represent'. Depending on the rounding mode, this either usually gets represented by max float(RTZ) or Inf (RNE):
0 110 1111 * 0 110 1111 = 0 111 0000
(Note that the overflowing of integers as you know it could have been avoided in hardware by applying a similar clamping operation, it's just not the convention to do that.)
When dealing with floating point numbers the term underflow means that the number is 'too small to represent', which usually just results in 0.0:
0 000 0001 * 0 000 0001 = 0 000 0000
Note that I have also heard the term underflow being used for overflow to a very large negative number, but this is not the best term for it. This is an example of when the result is negative and too large to represent, i.e. 'negative overflow':
0 110 1111 * 1 110 1111 = 1 111 0000
Max Koretskyi
Founder of inDepth.dev (@indepth_dev) community. Passionate about Mentorship, TechEd and WebDev. Angular & React contributor.
Updated on June 07, 2022Comments
-
Max Koretskyi almost 2 years
I feel I don't really understand the concept of
overflow
andunderflow
. I'm asking this question to clarify this. I need to understand it at its most basic level with bits. Let's work with the simplified floating point representation of1
byte -1
bit sign,3
bits exponent and4
bits mantissa:0 000 0000
The max exponent we can store is
111_2=7
minus the biasK=2^2-1=3
which gives4
, and it's reserved forInfinity
andNaN
. The exponent for max number is3
, which is110
under offset binary.So the bit pattern for max number is:
0 110 1111 // positive 1 110 1111 // negative
When the exponent is zero, the number is subnormal and has implicit
0
instead of1
. So the bit pattern for min number is:0 000 0001 // positive 1 000 0001 // negative
I've found these descriptions for single-precision floating point:
Negative numbers less than −(2−2−23) × 2127 (negative overflow) Negative numbers greater than −2−149 (negative underflow) Positive numbers less than 2−149 (positive underflow) Positive numbers greater than (2−2−23) × 2127 (positive overflow)
Out of them I understand only positive overflow which results in
+Infinity
, and the example would be like this:0 110 1111 + 0 110 1111 = 0 111 0000
Can anyone please demonstrate the three other cases for overflow and underflow using the bit patterns I outlined above?
-
Mark Dickinson over 7 yearsI've never encountered "underflow" to mean "large and negative", in the context of floating-point. Do you have any links or references?
-
Max Koretskyi over 7 yearsthanks, how do you get negative infinity in overflow and negative zero in underflow?
-
Casperrw over 7 yearsNo references, but I have heard people refer to it like that which I think is not great so let me edit the answer to clarify that.
-
Casperrw over 7 yearsNegative zero is used when the fully accurate result of a calculation would be a small negative number. Negative infinite is the last example above, which I'm about to clarify.
-
Max Koretskyi over 7 years@Casperrw, thanks, so is my understanding correct that overflow occurs for positive or negative numbers over MAX_VALUE, while underflow occurs for positive or negative numbers under MIN_VALUE?
-
Casperrw over 7 yearsSorry for any confusion - I think the terms 'over' and 'under' are ambiguous, but you probably mean 'larger and smaller in magnitude', in which case you are correct. Just to rephrase: Positive or negative overflow is when a value larger in absolute magnitude than the positive or negative max value is needed. Underflow is when a value is smaller in magnitude than the smallest representable number - either positive or negative. Hope that helps!