Emulate "double" using 2 "float"s

12,699

Solution 1

double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.

https://hal.archives-ouvertes.fr/hal-00021443 Guillaume Da Graça, David Defour Implementation of float-float operators on graphics hardware, 7th conference on Real Numbers and Computers, RNC7.

http://andrewthall.org/papers/df64_qf128.pdf Andrew Thall Extended-Precision Floating-Point Numbers for GPU Computation.

Solution 2

This is not going to be simple.

A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).

A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).

You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?

Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")

Solution 3

It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.

Solution 4

If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.

(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)

Solution 5

Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.

A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.

Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.


[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.

Share:
12,699
Admin
Author by

Admin

Updated on June 06, 2022

Comments

  • Admin
    Admin almost 2 years

    I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).

    The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?

    How can this be done?


    Edit (Clarity): I need the extra significant digits rather than the extra range.

  • phkahler
    phkahler almost 13 years
    +1 interesting technique I hadn't heard of. Won't help though if his inputs need the extra precision to start with (not sure).
  • R.. GitHub STOP HELPING ICE
    R.. GitHub STOP HELPING ICE almost 13 years
    +1 unlike the other answers this one actually addresses OP's question and gives very good links to the relevant papers.
  • phuclv
    phuclv almost 10 years
    using float-float technique he can't achieve double's range as well as precision, but that's significantly more than float and much faster than software double in case one has only hardware float arithmetics such as in case of the old CUDA or ARM CPUs
  • Slava P
    Slava P over 7 years
    Excellent! I was just looking for a way to make my constexpr functions more precise And it only took half an hour to implement the functions in the first link.
  • phuclv
    phuclv over 7 years
    It's actually very practical in case one doesn't need the extra range of double and was widely used in old NVIDIA CUDA GPUs when they didn't support double or in some compilers for near-quadruple-precision when hardware support is not available
  • phuclv
    phuclv over 7 years
    in case he doesn't really need arbitrary precision (like 46-bit precision is enough) then MPFR is overkill
  • Don Hatch
    Don Hatch over 3 years
    @phkahler If an input to the summation is already float-float, just feed the two floats in separately.