Converting from double to float in Java

java floating-point double type-conversion

28,882

Solution 1

From the Java Language Specification, section 5.1.3:

A narrowing primitive conversion from double to float is governed by the IEEE 754 rounding rules (§4.2.4). This conversion can lose precision, but also lose range, resulting in a float zero from a nonzero double and a float infinity from a finite double. A double NaN is converted to a float NaN and a double infinity is converted to the same-signed float infinity.

and section 4.2.4 says:

The Java programming language requires that floating-point arithmetic behave as if every floating-point operator rounded its floating-point result to the result precision. Inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen. This is the IEEE 754 standard's default rounding mode known as round to nearest.

Solution 2

I would suggest that floating-point types are most usefully regarded as representing ranges of values. The reason that 0.1f displays as 0.1 rather than as 0.100000001490116119384765625 is that it really represents the range of numbers from 13421772.5/134217728 to 13421773.5/134217728 (i.e. from 0.0999999977648258209228515625 to 0.1000000052154064178466796875); it wouldn't make sense to add extra digits indicating the number is greater than 0.100 when it might be less, nor to use a string of nines indicating the number is less than 0.100 when it might be greater.

Casting a double to a float will select the float whose range of values includes the range of doubles represented by the double. Note that while this operation is non-reversible, the result of the operation will generally be arithmetically correct; the only time it would not be 100% arithmetically correct would be if one were casting to float a double whose range was precisely centered on the boundary between two floats. In that situation, the system would select the float on one side or the other of the double's range; if the double in fact represented a number on the wrong side of the range, the resulting conversion would be slightly inaccurate.

In practice, the tiny imprecision mentioned above is almost never relevant, because the "range of values" represented by a floating-point type is in practice a little larger than indicated above. Performing a calculation (such as addition) on two numbers that have a certain amount of uncertainty will yield a result with more uncertainty, but the system won't keep track of how much uncertainty exists. Nonetheless, unless one performs dozens of operations on a float, or thousands of operations on a double, the amount of uncertainty will usually be small enough not to worry about.

It's important to note that casting a float to a double is actually far more dangerous operation than casting double to float, even though Java allows the former implicitly without a warning but squawks at the latter. Casting a float to a double causes the system to select the double whose range is centered about the center of the float's range. This will almost always result in a value whose actual uncertainty is far greater than would be typical of double-precision numbers. For example, if one casts 0.1f to double, the resulting double will represent a number in the range 0.10000000149011611 to 0.10000000149011613, even though the number it's supposed to be representing (one tenth) is, relatively speaking, nowhere near that range.

28,882

Author by

Franklin

I'm a software engineer. I really like Java. email: franklin [dot] hanner [at] gmail [dot] com website: www.frankhanner.com Cheers!

Updated on July 09, 2022

Comments

Franklin almost 2 years

If I'm working with a double, and I convert it to a float, how does this work exactly? Does the value get truncated so it fits into a float? Or does the value get rounded differently? Sorry if this sounds a bit remedial, but I'm trying to grasp the concept of float and double conversions.
Franklin about 12 years

Thanks for this. I noticed that it mentions the use of IEEE 754 round to nearest. Is there anyway to specify a different rounding mode?
Oliver Charlesworth about 12 years

@Franklin: There's a RoundingMode class, but I think that only applies to BigDecimal and BigInteger operations, not to operations on primitives. But I'm not 100% confident on that.
Voo about 12 years

Java supports only one fp rounding mode - there was some talk about adding more years ago (mostly for the HPC community; ie also about handling denorms, etc.), but alas that didn't go anywhere.
Oliver Charlesworth about 12 years

@Voo: Interesting. Do you happen to know of anywhere specific that I could read about that? (as someone in the HPC community...)
Voo about 12 years

@Oli Bloch (I think?) mentioned that as a sidenote in an interview, as an example of how communities shape a language (ie not enough support from the overall community to add that feature because only one subgroup wanted it and they only wanted to add stuff when there was a larger consensus) but there wasn't much more (pretty sure it was the Seibel book), but not much more. I do know that Java supports only one rounding mode because I talked with some JVM guy about that and how it simplifies things for them (was some time ago so could've changed - but I doubt it)
Franklin about 12 years

Thanks for the help. I'm essentially trying to truncate the result instead of round to the nearest, so I'm looking at using round toward zero. Any idea how this could be achieved?
Oliver Charlesworth about 12 years

@Franklin: Off the top of my head, the only thing I can think of is messing with the bitwise representation (using e.g. Double.doubleToIntBits()). Hopefully, there are better solutions than that!
Voo about 12 years

@Franklin I can think of only two ways as well. Obviously you can convert the double to long and then implement the rounding mode in software - if it's a rare operation that'll do, but it'll kill performance horribly otherwise. The other one would be to write a JNI function and implement it easily in C. No idea how much the performance cost is these days for crossing that barrier (you just pass a single double so no problems with memory though).
Louis Wasserman about 12 years

@Franklin, just to clarify: you're trying to truncate the double to a float? I think doubleToLongBits is the only realistic way to do that, and it'll be fast, but it might be a bit complicated.
supercat almost 11 years

Casts from double to float may lose specificity on their range, but casting e.g. 1E40 or 1E140 to a float will correctly yield single-precision positive infinity. The system won't be able to distinguish those numbers from each other, but it will correctly recognize both as being larger than any non-infinite float. Casting that value to double, however, will yield something that erroneously compares greater than 1E+308, i.e. that's off by 260 orders of magnitude. I'd say double->float does a much better job or preserving magnitude than float->double.
supercat over 9 years

@Franklin: The simplest approach is probably to cast to float and compare magnitude. If the float is larger, multiply by the quantity (16777215f/16777216f) to get the next smaller one.