Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

floating-point sse2 x87

12,122

Solution 1

The biggest problem with x87 is basically that all register operations are done in 80 bits, whereas most of the time people only use 64 bit floats (i.e. double-precision floats). What happens is, you load a 64 bit float into the x87 stack, and it gets converted to 80 bits. You do some operations on it in 80 bits, then store it back into memory, converting it into 64 bits. You will get a different result than if you had done all the operations with just 64 bits, and with an optimizing compiler it can be very unpredictable how many conversions a value might go through, so it's hard to verify that you're getting the "correct" answer when doing regression tests.

The other problem, which only matters from the point of view of someone writing assembly (or indirectly writing assembly, in the case of someone writing a code generator for a compiler), is that the x87 uses a register stack, whereas SSE uses individually accessible registers. With x87 you have a bunch of extra instructions to manipulate the stack, and I imagine Intel and AMD would rather make their processors run fast with SSE code than trying to make those extra stack-manipulation x87 instructions run fast.

BTW if you are having problems with inaccuracy, you will want to take a look at the article "What every programmer should know about floating-point arithmetic", and then maybe use an arbitrary precision math library (e.g. GMP) instead.

Solution 2

To make proper use of extended-precision math, it's necessary that a language support a type which can be used to store the result of intermediate computations, and can be substituted for the expressions yielding those results. Thus, given:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  printf("%12.6f", (x2-x1)*(x2-x1)+(y2-y1)*(y2-y1));
}

there should be some type that could be used to capture and replace the common sub-expressions x2-x1 and y2-y1, allowing the code to be rewritten as:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  some_type dx = x2-x1;
  some_type dy = y2-y1;
  printf("%12.6f", dx*dx + dy*dy);
}

without altering the semantics of the program. Unfortunately, ANSI C failed to specify any type which could be used for some_type on platforms which perform extended-precision calculations, and it became far more common to blame Intel for the existence of extended-precision types than to blame ANSI's botched support.

In fact, extended-precision types have just as much value on platforms without floating-point units as they do on x87 processors, since on such processors a computation like x+y+z would entail the following steps:

Unpack the mantissa, exponent, and possibly sign of x into separate registers (exponent and sign can often "double-bunk")
Unpack y likewise.
Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
In case x and y had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
Pack the exponent and mantissa back into double format.
Unpack the that temporary result.
Unpack z.
Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
In case the earlier result and z had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
Pack the exponent and mantissa back into double format.

Using an extended-precision type will allow steps 4, 5, and 6 to be eliminated. Since a 53-bit mantissa is too large to fit in less than four 16-bit registers or two 32-bit registers, performing an addition with a 64-bit mantissa isn't any slower than using a 53-bit mantissa, so using extended-precision math offers faster computation with no downside in a language which supports a proper type to hold temporary results. There is no reason to fault Intel for providing an FPU which could perform floating-point math in the fashion that was also the most efficient method on non-FPU chips.

Solution 3

The other answer seems to suggest that using 80-bit precision is a bad idea, but it isn't. It performs a sometimes vital role in keeping imprecision at bay, see e.g. the writings of W. Kahan.

Always use 80-bit intermediate arithmetic if you can get away with it speed-wise. If that means you have to use x87 maths, well, do so. Support for it is ubiquitous and as long as people keep doing the right thing, it will remain ubiquitous.

12,122

Author by

codekaizen

I create bugs.

Updated on June 02, 2022

Comments

codekaizen almost 2 years

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote:

Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.

I wondered about this. I know that x87 uses 80-bit extended doubles internally to compute values, and SSE2 doesn't. Does this not matter to anyone? It seems surprising to me. I know when I do computations on points, lines and polygons in a plane, values can be surprisingly wrong when doing subtractions, and areas can collapse and lines alias one another due to lack of precision. Using 80-bit values vs. 64-bit values could help, I would imagine.

Is this incorrect? If not, what can we use to perform extended double FP operations if x87 is phased out?
Joe White almost 14 years

Optimizing compilers are bad enough, but try a JIT that has the ability to inline small methods (and therefore vary the number of in-memory temps). Sometimes I call this method and get one answer, sometimes I call the same method with the exact same arguments and get a different result, depending on whether the JITter inlined the call or not! That was a fun regression to track down.
codekaizen almost 14 years

Yes, I see, that does get complicated with compilers making these kinds of choices, moreso when JIT compilers do it. As to precision, I currently scale the number to [0..1] and remove common bits to decrease the noise due to bits just cancelling, and just imagined that 80 bits would give me more room. While true, apparently, the side-effects are too high of a cost. I hope to test it on QP hardware... whenever that shows up.
KitsuneYMG over 13 years

@Joe White If you're using java and you NEED exactly the same results every time you do floating point math, investigate the use of the strictfp keyword. This forces math to be IEEE 754 and not whatever the native platform does (x87 on 32b intel for instance). en.wikipedia.org/wiki/Strictfp
Joe White over 13 years

@KitsuneYMG, I'm actually using .NET. As far as I'm aware, there's no equivalent there. :(
ArchaeaSoftware over 11 years

It's worth mentioning that the 80-bit precision never was intended for storage. It was deliberately designed to serve as a higher-precision intermediate representation that would be converted back to float or double when the results are being stored.
supercat over 10 years

Would anything prevent an 8x87 compiler from keeping all intermediate results as 80-bit values, whether they fit in registers or not, and specifying that it will do so? Would results from such a compiler not be entirely reproducible with any other compiler that did likewise?
phoog over 10 years

@supercat If the x87 compiler complies with the CLI spec then it must truncate higher-precision values when there is an explicit conversion instruction. Even if we're not talking about the CLI, one must define "intermediate result." If a function returns a double, the return value is presumably not intermediate. But what if the function is inlined? Different compilers will presumably make different decisions about inlining. If the return value of an inlined function doesn't need to be truncated, then different compilers can give different results.
supercat over 10 years

@phoog: Some machines/compilers used 80-bit math internally, but would arbitrarily convert values to 64-bit double any time they couldn't fit in registers, so if someDouble=f1()*f2()+f3()*f4() was evaluated in left-to-right sequence, it might round f1()*f2() to a double but not round f3()*f4() [since no more function calls would be needed between the time it was computed and the time someDouble was stored]. That sort of behavior is icky and nasty. But if the rules for when things were rounded were independent of what did or did not fit in registers, I wouldn't see a problem.
supercat over 10 years

@phoog: Personally, what I'd like to see would be a language with distinct types for e.g. ieee float, fast float, and short real, where the product of two IEEE floats would always be rounded to float while fast float would be rounded or not as convenient. A short real would be a 32-bit floating-point value, but would be converted to the maximum precision type when performing math it if such conversion could improve the precision of the result [e.g. conversion would be required when computing f1=f2+f3+f4;, but not f1=f2+f3;].
supercat over 10 years

@phoog: Given that floating-variables are used in a number of different ways, having different types for different usage patterns would allow language designers to provide useful warnings in cases where a programmer who wants strict IEEE single-precision semantics accidentally multiplies by 1.01 rather than 1.01f, while allowing a programmer who wants to as accurately as possible multiply a single-precision float by 1.01 to do so without ugly typecasts.
Mark Dickinson over 8 years

Though, somewhat ironically, the intermediate 64-bit precision (not 80-bit precision) from use of the 80-bit x87 registers can lead to less accurate results for simple arithmetic operations on regular 53-bit doubles. Assuming the usual round-ties-to-even rounding mode, the operation 1e16 + 2.9999 on IEEE 754 binary64 values gives a correctly-rounded result of 10000000000000002.0 on a machine using SSE2, but an incorrectly-rounded result of 10000000000000004.0 when using x87 with FPU precision not altered from its default of 64-bit precision, thanks to double rounding.
supercat over 8 years

There are a few cases where using double-precision to compute x+y would yield a result with a round-off error of 1/2ulp, while using extended-precision and converting to double would yield a round-off error of2049/4096ulp. On the other hand, there are a lot more cases where using extended-precision to compute x+y+z will yield an accurate result, while using "double" will yield a result that's far less accurate, or in some cases Just Plain Wrong.
Mark Dickinson over 8 years

Right, but I think we can fault Intel for not providing a way to do standards-compliant correctly-rounded basic arithmetic operations (on 64-bit doubles) at all. Yes, you can change the FPU precision to 53 bits instead of 64 bits, but that's clunky, slow, risks interfering with library code that expects the 64-bit precision, and doesn't even solve the problem: while it eliminates double rounding in the normal domain, it doesn't change the exponent range, so still leaves the possibility of double rounding on underflow. SSE(2) is a big improvement in this respect.
supercat over 8 years

@MarkDickinson: While there are specialized applications which require bit-consistent floating-point behavior with operations involving shorter types, for most applications it is better to have proper support for extended precision. I see SSE(2) and x87 as serving different purposes, and would have liked to have seen languages support them both eagerly-promoting and strict floating-point types; further, expressions involving strict types should IMHO only be convertible to larger types after "visibly" coercing them to their own type, so if f1 and f2 were strict float types, d1=f1*f2...
supercat over 8 years

...would need to be written as d1=(float)(f1*f2); [not d1=(double)(f1*f2);!]. I would guess that in cases where someone writes d1=f1*f2; there is a very high likelihood that (1) the code would either have been intended to say d1=(double)f1*f2;, (2) a programmer who sees the code thinks it means that, or (3) a programmer who sees the code thinks it was intended to mean that. Requiring the code to be written as d1=(float)(f1*f2); in cases where that behavior is intended would eliminate those dangers.
Thaqif Yusoff about 8 years

but isn't long double an extended precision type?
supercat about 8 years

@marcin: It is, and I'd suggest many people's dislike for it is a consequence of languages' poor treatment of it. The design intention of C was that unsuffixed literals be the highest-precision type, and variadic function arguments should promote to the highest-precision type, so code like "printf("%9.4f/%9.4f", x, y*Y_SCALE);` wouldn't have to worry about the type of Y_SCALE, and even if the same value of Y_SCALE was sometimes used in float and double calculations. Having a long double type which isn't interchangeable in printf makes things awkward, as does...
supercat about 8 years

...having a declaration like long double d=0.1; set d to 0.10000000000000000555 rather than 0.10000000000000000000813151629364.
Peter Cordes over 7 years

serialize operation. You mean "parallel operation". Or SIMD operation.
Peter Cordes over 7 years

You still have to use fpu for trigonometric and other trancedental math operations. If you mean x87 FSIN, FYL2X (log2), etc. then no, that's incorrect. Math libraries implement those functions in software, with SSE math.
Peter Cordes over 7 years

Even before x87 was obsolete, good math libraries didn't use FSIN, because the internal value of Pi used for range reduction isn't accurate enough; only 66 bits. Intel isn't able to change this, for backwards compat reasons, but FSIN has large errors near +/- pi/2
user6801759 over 7 years

Yes. sorry, I meant parallel. Emulation is always much much slower. in fact that's we did before numeric processor existed. See Kahan notes on IEEE 754 design rationale en.wikipedia.org/wiki/Floating_point#IEEE_754_design_rationa‌le: "This Extended format is designed to be used, with negligible loss of speed,.." But for pragmatical reason (faster machine, larger capacity in everything), I guess no one bothers with used to be slow and bloated code anymore.
user6801759 over 7 years

About PI, you might see jpl.nasa.gov/edu/news/2016/3/16/… Multiprecission sure is nice, but it's for fun and exercise only.
Peter Cordes over 7 years

Emulating fsin in software is not much slower. The internal implementation is micro-coded with 71-100 uops (on Intel Haswell), with a total latency of 47-106 cycles, and (in this case) doesn't do anything that can't be done with simple x86 instructions that each decode to only a single uop. And re: Pi precision, the article you linked doesn't say anything about catastrophic cancellation or floating point problems. Did you even read Bruce Dawson's article that I linked earlier? Have you heard of catastrophic cancellation?
Peter Cordes over 7 years

BTW, welcome to Stack Overflow. You should edit your correction ("parallel") into the answer.
fuz over 6 years

Note that the x87 FPU actually has a control word which allows you to reduce the internal precision to 64 bit or even 32 bit to get bitwise identical results, but nobody seems to use that.
Peter Cordes about 5 years

@fuz: according to Bruce Dawson, MSVC used to reduce to 64-bit (53-bit significand) in its CRT startup. randomascii.wordpress.com/2012/03/21/… And DirectX apparently used to reduce it to float precision for your whole process!
Michel de Ruiter over 2 years

11 bits means a factor 2048: more than 3 digits...?