Is multiplication faster than float division?

32,307

Solution 1

Multiplication is faster than division. At university I was taught that division takes six times that of multiplication. The actual timings are architecture dependent but in general multiplication will never be slower or even as slow as division. Always optimize your code towards using multiplication if the rounding errors allow.

So in an example this would typically be slower ...

for (int i=0; i<arraySize; i++) {
    a[i] = b[i] / x;
}

... than this ...

y=1/x;
for (int i=0; i<arraySize; i++) {
    a[i] = b[i] * y;
}

Of course with rounding errors, you'll loose (a little) precision with the second method, but unless you are repeatedly calculating x=1/x; that's unlikely to cause much issue.

Edit:

Just for reference. I've dug up a third party comparison of operation timings by searching on Google.

http://gmplib.org/~tege/x86-timing.pdf

Look at the numbers on MUL and DIV. This indicates differences of between 5 and 10 times depending on the processor.

Solution 2

It is quite likely that the compiler will convert a divide to a multiply in this case, if it "thinks" it's faster. Dividing by 2 in floating point may also be faster than other float divides. If the compiler doesn't convert it, it MAY be faster to use multiply, but not certain - depends on the processor itself.

The gain from manually using multiply instead of divide can be quite large in cases where the compiler can't determine that it's "safe" to do so (e.g. 0.1 can't be stored exactly as 0.1 in a floating point number, it becomes 0.10000000149011612). See below for figures on AMD processors which can be taken as representative for the class.

To tell if your compiler does this well or not, why don't you write a bit of code to experiment. Make sure you write it so that the compiler doesn't just calculate a constant value and discards all the calculation in the loop tho'.

Edit:

AMD's optimisation guide for Family 15h processors, provide figures for fdiv and fmul are 42 and 6 respectively. SSE versions are a little closer, 24 (single) or 27 (double) cycles for DIVPS, DIVPD DIVSS and DIVSD (divide), and 6 cycles for all forms of multiply.

From memory, Intel's figures aren't that far off.

Solution 3

Floating point multiplication usually takes fewer cycles than floating point division. But with literal operands the optimizer is well aware of this kind of micro-optimizations.

Share:
32,307
Admin
Author by

Admin

Updated on March 03, 2020

Comments

  • Admin
    Admin about 4 years

    In C/C++, you can set up the following code:

    double a, b, c;
    ...
    c = (a + b) / 2;
    

    This does the exact same thing as:

    c = (a + b) * 0.5;
    

    I'm wondering which is better to use. Is one operation fundamentally faster than the other?

  • Thomas
    Thomas almost 11 years
    "Can't understand why this has been down voted so much" because on SO any question that even comes remotely close to undefined/unspecified/implementation dependent behaviour is likely immediately downvoted, closed, and swept under the carpet. Apparently pedantry is more relevant than realistic and pragmatic development needs. Slightly tongue-in-cheek but it sometimes feels that way.
  • Eric Postpischil
    Eric Postpischil almost 11 years
    This is a situation where we cannot rely on the optimizer, if we ask about cases other than single example in the question. Division by a power of two and multiplication by the inverse are equivalent in binary floating-point, but divisions by other factors have no equivalent multiplication, because the reciprocal cannot be exactly represented. This prohibits the optimizer from making the transformation. Thus, programmers should be aware of this and should favor multiplication if they know the rounding error in the inverse is acceptable.
  • Mats Petersson
    Mats Petersson almost 11 years
    @EricPostpischil: Does it "help" to give flags for "use fast math" in cases where we don't care too much about precision? Say, forexample in case of 3d games, you'd expect that the last couple of bits of floating point precision doesn't matter.
  • Philip Couling
    Philip Couling almost 11 years
    @Thomas yup. What's galling is that this isn't even ambiguous or undefined. Processor manufacturers usually publish the stats. This is where the stats come from for compilers to perform the optimizations. So here "undefined" means "I haven't read it". aaarrrrrrr. Okay. </rant>
  • Mats Petersson
    Mats Petersson almost 11 years
    What the h*** did I get wrong now, that I get downvoted for providing more information? Please tell me so I can learn!
  • jason
    jason almost 11 years
    I didn't downvote, have no idea why someone else did, but I hope you don't mind that I corrected your statement about 0.1.
  • Stephen Canon
    Stephen Canon almost 11 years
    x/y —> x * (1/y) is licensed by -ffast-math. Specifically, this is licensed by -freciprocal-math, which is one of the “optimizations” included in -ffast-math.
  • josesuero
    josesuero almost 11 years
    It's not quite clear what you mean in the second paragraph (the parenthesis is never closed, and the sentence kind of trails off. But as has been pointed out in the comments to other answers, the compiler typically won't convert divisions to multiplies when dealing with floating-point code.
  • Damon
    Damon almost 11 years
    @Thomas: I would guess that the downvotes come from the fact that the question is pretty trivial and has an exact duplicate that's automatically shown in the "related" column to the right (read as: no research effort), the optimization has a high likelihood of being premature (think out of order execution and memory bandwidth), and the code snippet contains implicit int-double conversions and extra operations that are unrelated to the question. (That's only a guess, of course, I'm not one of the downvoters, so can't tell)
  • Mats Petersson
    Mats Petersson almost 11 years
    @Jalf: yes, I have clarified the second paragraph a little bit.
  • Lightness Races in Orbit
    Lightness Races in Orbit almost 11 years
    @Damon is correct. Guessing at downvoters' motivations may be a fun game but, in this case at least, Thomas lost.
  • Mike Dunlavey
    Mike Dunlavey almost 11 years
    @Thomas: I tend to go easy on newbies. We all make or at least made mistakes at one point or another, and getting scorched for it doesn't make anyone feel welcome. Helping less experienced people is a good thing, and StackOverflow is very good at that.
  • Peter Cordes
    Peter Cordes about 8 years
    For x86, see agner.org/optimize. div and mul have different ratios of throughput to latency. Both are only 1 uop, so can overlap well with other operations. Your ~6x more expensive looks about right for throughput, but the latency ratio these days for FP div is more like 4 (on Intel Haswell). Intel Skylake has a very heavily pipelined FP div unit (throughput of one double result per 4 cycles). (256b SIMD vectors have lower throughput).