Real numbers - how to determine whether float or double is required?

c++ c floating-point

23,255

Solution 1

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic

Unfortunately, I don't think there is any way to automate the decision.

Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.

In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.

In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.

That leads to the following strategy:

If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

Solution 2

I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.

Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.

If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:

If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?

More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.

Lastly, see additional discussion on float vs. double.

Solution 3

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.

Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.

Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).

The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10^±38, double goes to ~ 10^±308.

As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.

Solution 4

A very detailed post that may or may not answer your question.

An entire series in floating point complexities!

View more solutions

23,255

Author by

Soham Chakraborty

Updated on August 04, 2022

Comments

Soham Chakraborty over 1 year

Given a real value, can we check if a float data type is enough to store the number, or a double is required?

I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
SChepurin over 11 years

Please, do not suggest the solution like this one. Float and double are different in many aspects.
SChepurin over 11 years

@Angew - I leave it to your research. But you can freely disagree with that.
Victor K over 11 years

if you cast double to float and then back to double, result is almost(*) never equal to original value, even if the original value can be represented as float (up to its precision)
Eric Postpischil over 11 years

@VictorK: What do you mean that, if the original value can be represented as float, converting to float and back to double almost never produces the original value? If the value in a double is exactly representable as a float, then both conversions produce the exact value; there is no change.
Pete Becker over 11 years

Umm, I read the first dozen or so items in the series on floating point complexities, and they're at best oversimplified and at worst downright wrong. For example, "FLT_MIN is not the smallest positive float (FLT_MIN is the smallest positive normalized float)" is true if your hardware does subnormals. Most does, but not all. And that's why std::numeric_limits has a Boolean member named has_denorm.
SChepurin over 11 years

@Eric Postpischil - Note that the question was about precision. Handling float and double representation for a value you most likely will have to take care about different formatting like std::setprecision.
Eric Postpischil over 11 years

@SChepurin: That statement does not appear to be related to my question.
SChepurin over 11 years

@Eric Postpischil - Agree:) This is kinda twisted discussion. Just wanted to provide one of the reasons do not implement this solution.
jonathanasdf over 11 years

That particular article does state that it is talking about the IEEE 754 standard, in which subnormals ARE defined. If your hardware does not happen to be standards compliant, then you can hardly blame an article about the standard to be wrong regarding your hardware. The articles might be oversimplified, but for someone with no knowledge of the whole floating-point business, I feel it is at the right level of complexity.
Pete Becker over 11 years

I only looked at the first page, but I don't see where it says it's about IEEE 754. Regardless, C++ does not require IEEE 754. The problem most people have with floating-point arithmetic is that their view of it is oversimplified; yet another oversimplification doesn't help that.
Pascal Cuoq over 11 years

@PeteBecker For a large majority of programmers, assuming that their programming platform provides them with IEEE 754 floating-point arithmetics and understanding what this means (with some of the implications listed on altdevblogaday.com/2012/04/05/floating-point-complexities ) would be a huge improvement.
Jakob S. over 11 years

@Eric Postpischil: This is exactly what I had in mind. In all other cases I would say: float is not sufficient, as the number is not exactly representable as float and therefore "something" is lost. Whether you do or do not care about that "something" has to be decided by the developer and not the machine.
Pete Becker over 11 years

@PascalCuoq - sure, if it's stated clearly that what's being said applies to IEEE 754 implementations. My objection to the article in question is that it provides cute generalities without supplying that context.
Victor K over 11 years

@Eric Postpischil double has 53-bit significand, float has 24-bit significand, when you convert double to float, you lose 29 bits, even if number is within min/max values for single-precision float (I didn't say if it can be represented exactly; I guess, it's my bad choice of words)
Eric Postpischil over 11 years

@VictorK: The code in this answer is intended to detect whether a double is exactly representable as a float. Given that, the behavior you describe is not a criticism; it supports the purpose of the code: A double that cannot be exactly represented by a float is altered by the round-trip conversions, and a double that can be exactly represented by a float is not altered. That is the intent.
Victor K over 11 years

ok, I agree that it precisely answers the question. it's the question that i find err... questionable. what's the problem OP is trying to solve?
Potatoswatter over 11 years

Vector computing (e.g. SSE) may get twice the throughput through the same ALU using single precision vs double, so 64-bit ALUs being commonplace isn't a good argument. Likewise you can fit twice as many 32-bit numbers through a data bus in the same amount of time, regardless of the width of the bus. The motivation for making things smaller is performance. Anyway, some kind of analysis of precision is usually warranted, since without that you can be blindsided by a precision bug in 64-bit just as in 32-bit.
Raj about 4 years

For double shouldn't it be log10(2^53) = 15.95 digits.
Potatoswatter about 4 years

@Raj The implicit leading 1 also counts even though it doesn’t take storage space.
Raj about 4 years

52 bits for mantissa and an implicit leading 1?? So total 53. Am I missing something?