Is using double faster than float?

c++ performance x86 intel osx-snow-leopard

27,388

Solution 1

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

are double operations just as fast or faster than float operations for +, -, *, and /?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double

However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.

@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

Solution 2

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a double calculation and a float calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the double or float floating-point format. Moving sizeof(double) bytes to/from RAM versus sizeof(float) bytes is the only difference in speed.

If, however, you have a vectorizable computation, then you can use the SSE extensions to run four float calculations in the same time as two double calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.

Solution 3

Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.

Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.

So for the application I am working with, the difference is quite important.

Solution 4

I just want to add to the already existing great answers that the __m256? family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 double s in parallel (e.g. _mm256_add_pd), or 8 floats in parallel (e.g. _mm256_add_ps).

I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.

Solution 5

In experiments of adding 3.3 for 2000000000 times, results are:

Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double

So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.

Even Stroustrup recommends double over float:

"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."

Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.

View more solutions

27,388

Author by

Brent Faust

Scala, Python, Matlab, C++, Java, Swift, Obj-C, JavaScript, Ruby Linux, OS X, Android, iOS, AWS

Updated on September 02, 2021

Comments

Brent Faust over 2 years

Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats?

That is, are double operations just as fast or faster than float operations for +, -, *, and /?

Does the answer change for 64-bit architectures?
Razor Storm almost 14 years

This would also depend on the cache block size, correct? If your cache retrieves 64bit or larger blocks, then a double would be just as efficient (if not faster) than a float, at least so far as memory reads/writes is concerned.
Peter G. almost 14 years

@Razor If you work exactly as many floats as fit in a cache line, then if you used doubles instead the CPU will have to fetch two cache lines. The caching effect I had in mind when reading Alex' answer however is: Your set of floats fits in you nth level cache but the corresponding set of doubles doesn't. In this case you will experience a big boost in performance if you use floats.
Razor Storm almost 14 years

@Peter, yeah that makes sense, say you have a 32 bit cacheline, using doubles would have to fetch twice every time.
Alex Martelli almost 14 years

@Razor, the problem's not really with fetching/storing just one value -- it is, as @Peter's focus correctly indicates, that often you're fetching "several" values to operate on (an array of numbers would be a typical example, and operations on items of such arrays very common in numerical applications). There are counterexamples (e.g., a pointer-connected tree where each node only has one number and a lot of other stuff: then having that number be 4 or 8 bytes will matter pretty little), which is part of why I say that in the end you have to benchmark, but the idea often applies.
Razor Storm almost 14 years

@Alex Martelli, I see. That makes sense.
Brent Faust over 11 years

+1 for making the effort to do some timings. But Stroustrup doesn't recommend using 'double' because it's faster, but because of the extra precision. Regarding your last comment, if you need that extra precision more than saving memory, then it's quite possible you'd want to use 'double' on 32-bit hardware. And that leads back to the question: Is double faster than float even on 32-bit hardware with a modern FPU that does 64-bit computations?
imallett almost 9 years

A few hundredths of a second difference feels like it's still within the realm of experimental error. Especially if there's other stuff too (like maybe a not-unrolled loop . . .).
sunside about 8 years

It's quite a stretch to say that Stroustrup is recommending double there when he is actually recommending to RTFM.
Peter Cordes over 7 years

What hardware, what compiler + options, what code? If you timed all 3 in the same program, clock-speed ramp-up time explains the first being slower. Clearly you didn't enable auto-vectorization (impossible for a reduction without -ffast-math or whatever, because FP math isn't strictly associative). So this only proves that there's no speed difference when the bottleneck is scalar FP add latency. The bit about 64-bit hardware makes no sense either: float is always half the size of double on any normal hardware. The only difference on 64-bit hardware is that x86-64 has SSE2 as a baseline.
Trevor Boyd Smith about 6 years

i did ten iterations of a loop where the loop did std::vector<std::complex<float or double>>, size=10*1000*1000, filled by rand(). const auto p2 = x[i] * x[i]; const auto p4 = p2 * p2; const auto p8 = p4 * p4; y[i] = p8;. the float elapsed time was 0.95 seconds. the double elapsed time was 0.24 seconds.
Peter Cordes over 5 years

double add/sub/mul is as fast as float in modern x86 CPUs, but not div or sqrt. Double has somewhat worse latency and throughput. Floating point division vs floating point multiplication
Peter Cordes about 4 years

Did you choose sizes that make float fit in some level of cache while double doesn't? If you were just bound on memory bandwidth in the same level of cache, you'd expect a simple factor of 2 difference in most. Or are more of those results for a single "vector" of 3 values stored contiguously, not in a SIMD-friendly way, and not amortized over a large array? So what kind of terrible asm did GCC make that led to copy taking a couple cycles for 3 floats but 10x that for 3 doubles?
Jedzia about 4 years

It's a very good observation, Peter. All theoretical explanations here are valid and good to know. My results are a special case of one setup of many different solutions possible. My point isn't how horrible my solution may be but that in praxis there are too much unknowns and you have to test your particular use-case to be sure. I appreciate your analysis. This helps me:) But lets focus on the question asked by the OP.
Peter Cordes about 4 years

Ok, that's fair, demoing the fact that compilers can totally suck for no apparent reason when you change float to double is interesting. You should maybe point out that that's what your answer shows, not any fundamental issue or general case.
Jedzia about 4 years

The guilty one here is me, of course. With my devilish use of "volatile". The compiler has no chance to optimize anything, which was also my goal for this special case. So don't judge GCC to hard:)
Jedzia about 4 years

To add some backstory: I was just as curious as the OP. Does using double instead of float make a difference? How I read the results: The first ones are to isolated and only the last two ones indicate what to expect in a real world case -> no difference. In my special case. Thanks to Corona i had the time to go down this rabbit-hole. This kind of investigation can add many hours and you have to decide on your own if it is practical. Let's say for a FPS improvement from 999 to 1177...
Peter Cordes about 4 years

That's definitely getting into "irrelevant" territory, then. You only used volatile on the final result, so GCC could just compile them all to stores of compile-time-constant results. You didn't include the asm, and your code depends on some headers so it's not easy to look at how it compiled on godbolt.org; I guess I could clone your repo and compile it locally if I really wanted to, but IMO it's up to you at this point to demonstrate that your results mean anything.
Peter Cordes about 4 years

Your BM_DoubleCreation (and float) give us a baseline of an empty loop presumably running at 1 cycle per iteration; a volatile with no initializer still optimizes to zero asm instructions with GCC and clang.
Peter Cordes about 4 years

what to expect in a real world case - in many real world cases, you expect a factor of 2 from either memory bandwidth and/or being able to compute twice as many elements per SIMD vector. (addps and addpd have equal throughput for 16 bytes of FP data, but the ps version is 4 elements instead of 2.) So doing anything with SIMD-friendly arrays can usually benefit. See deplinenoise.wordpress.com/2015/03/06/… for more about SIMD-friendly data layout, i.e. arrays of x[], y[], z[], not packed xyz groups. stackoverflow.com/tags/sse/info
Jedzia about 4 years

Wonderful addition, Compiler Explorer is much more accessible and can provide a quick overview of simpler problems.
Peter Cordes about 4 years

If you weren't aware of Godbolt, see How to remove "noise" from GCC/clang assembly output? for how to get simple readable optimized asm for a small function.
Jedzia about 4 years

Thanks, i was aware of it and him. Again my point: "Measure it, then you know it. Here are some tools, use it. It may look like this."
Jedzia about 3 years

By the way, Peter: You are falling into a tirade of conclusions that can apply to my example. But also not. I keep it general and my point is: Measurement Is Knowledge! And I give an example of how one can measure. That we discovered that something is wrong with the "volatile" is just the beauty. The measured values provide information. Better than guessing, isn't it? I don't think that this is irrelevant at all. On the contrary. And that's a personal opinion. To devalue others for this is immature.
Jedzia about 3 years

The question of the thread covers a huge area of possible CPUs and even raises the question of what it is like with other architectures, such as ARM, MCUs, etc. My answer to that: Don't ask, measure it yourself.
Peter Cordes about 3 years

Better than guessing, isn't it? - yes, but only if you check what the compiler did so you know what you're measuring. Making conclusions based on microbenchmarks that measured something completely different from what you intended can be worse than realizing that something is unknown. But unfortunately that's all too easy when you need compilers to optimize like normal except for still doing some redundant work.