Is integer multiplication really done at the same speed as addition on a modern CPU?

c++ performance cpu multiplication addition

35,772

Solution 1

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.

Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.

Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.

So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

Solution 2

Integer multiplication will be slower.

Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

Solution 3

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.

The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).

In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.

In Addition: It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.

To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.

Picture of time through a multiply:

|--------------------------------------------------| Non-Pipelined

In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.

With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.

Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.

All of this can be summed up as follows:

Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.

Solution 4

Intel since Haswell has

add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)

Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info

But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.

You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).

Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).

See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.

Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.

More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.

Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.

And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.

Solution 5

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.

So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.

On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:

LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry  (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add

If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.

The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.

I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.

the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles

multiply by 2  (0000 0010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles).
ASL  - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.

multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead

Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.

LDA instruction (1 cycle because it is an 8 bit value) $0000 (16 bit value requires two cycles to load) memory location (requires two cycles to load because of an 8 bit data bus)

Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.

To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.

Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.

Regards, Ken

View more solutions

35,772

Author by

exebook

I see questions as impulses, vibrations and currents in my body.

Updated on July 09, 2022

Comments

exebook almost 2 years

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?

I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:

#include <stdio.h>
#include <sys/time.h>

unsigned int time1000() {
    timeval val;
    gettimeofday(&val, 0);
    val.tv_sec &= 0xffff;
    return val.tv_sec * 1000 + val.tv_usec / 1000;
}

int main() {
    unsigned int sum = 1, T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i + (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
    sum = 1;
    T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i * (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
}

The code above can show that multiplication is faster:

clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456

But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

Mysticial about 10 years

It's worth nothing that on Haswell, the FP multiply throughput is double that of FP add. That's because both ports 0 and 1 can be used for multiply, but only port 1 can be used for addition. That said, you can cheat with fused-multiply adds since both ports can do them.
trumpetlicks about 10 years

@Mysticial - Good point, also another reason why my answer is more general to talk about architecture of the machine :-) Great Note.
Ben Voigt about 10 years

On the other hand yours was 64-bit multiplication, used to emulate 32-bit division
user541686 about 10 years

@BenVoigt: Wait, is there a "more" 32-bit kind of 32-bit multiplication available than the one the compiler was using?
Ben Voigt about 10 years

There are multiplies that yield 32-bit results, but I'm not sure of any that take arbitrary 32-bit operands and have a 32-bit result.
Dolda2000 about 10 years

"Who told you that?" -- To be fair, perhaps someone who mistook it for FP add/mul. Perhaps an understandable mistake.
Dolda2000 about 10 years

Also, "while MUL takes 2–4 cycles" -- While this is true, that's only the latency. MUL most likely has a throughput of (at least) one instruction per cycle. At least this is the case on Sandy Bridge, and I doubt Haswell does worse. ADDSS/MULSS also have throughputs of one instruction per cycle on Sandy Bridge (but latencies of 3 and 5, resp.).
trumpetlicks about 10 years

@Mysticial - should note though that while the architecture supports 2 FP multipliers, versus a single add, does not mean that when looking at a single multiply operator versus a single add multiplier that it is faster. Again this is HW system level architecture, not direct performance measure of a single add versus a single multiply.
Cory Nelson about 10 years

@Dolda2000 that's what Agner lists as throughput of 32-bit MUL/IMUL using only registers. Interestingly, he lists the 64-bit variant as 1 cycle, not 2 -- I wonder if it's a typo.
Nick Bastin almost 10 years

-1, the implementation outlined here is painfully naive - this is not the "usual" algorithm.
mathreadler over 7 years

It's not in the same speed class because the clock has hit the speed limit. What it means is that the CPU is idle much more of the time during one clock cycle if you do an ADD because as transistors shrunk but the max clock frequency stayed the same the so the signal could pass more gates in one clock.
cmaster - reinstate monica over 7 years

@mathreadler How long it takes the hardware gates to come up with the correct result is entirely irrelevant to the software: The result cannot take effect before the next clock cycle starts. Also, you can bet on the chip designers not to build adders that are significantly faster than they must be to fulfill the requirements of the intended clock frequency: That would mean using precious chip real estate for pointlessly complex carry-look-ahead generators. These circuits are needed because even today it's not possible to ripple a carry through 64 bits in a single cycle.
trumpetlicks over 5 years

I hear a little bit of what you are saying, however look here, and you will see that multiplication is STILL reliant on multiple additions!!! The numbers on slide 53 are direct from AMD. You may be correct, but it is also a die size and cost of implementation versus statistical usage (versus addition alone) problem. cis.upenn.edu/~milom/cis371-Spring08/lectures/04_integer.pdf
user541686 over 5 years

@trumpetlicks: Confused, did I say multiplication isn't dependent on additions? I thought I myself explicitly said multiplication is dependent on additions.
user541686 over 5 years

@trumpetlicks: Wha....? What are you talking about? I didn't even use the words "cycle" or "clock" or "pipeline" etc. in my answer. Nor did I ever claim multiplication is as fast as addition (again, I said the opposite). I don't know what answer you're reading that is correct or incorrect on these points, but it definitely isn't mine...
Shajeel Afzal over 5 years

It is understood that Multiplication is slower than Addition so why we take unit cost for both in Algorithm Analysis?
Nike almost 5 years

"It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock." How many multiplies can a single Haswell thread do simultaneously, and how many adds can it do simultaneously?
Nike almost 5 years

@Cory Nelson, I upvoted you, but that link to Agner Fog's page doesn't say a single thing about # of clock cycles does it?
Cory Nelson almost 5 years

@user1271772 the "Latency" column defines the minimum latency, in clock cycles, of each instruction. Note that pipelining means that instructions can overlap so this does not indicate throughput.
Nike almost 5 years

@CoryNelson, what I mean is, that link that you gave, just contains a bunch of URLs to other pages. It doesn't seem to say anything about latency or clock cycles!
Peter Cordes almost 5 years

en.wikipedia.org/wiki/Binary_multiplier#Implementations mentions some modern implementation options, like en.wikipedia.org/wiki/Wallace_tree for example. And no, you don't use O(N) ripple-carry addition in a modern x86-64 ALU! That would be 64 full-adder delays in the critical path for 1-cycle integer add. en.wikipedia.org/wiki/Carry-lookahead_adder is one way around this problem, or there's also carry-select. Fun fact: Pentium 4 really did use 16-bit ripple-carry adders at very high clock speed.
adieux about 4 years

Any estimation on a modern CPU (e.g. Intel's i7) how many cycles an ADD or MUL can take?
Peter Cordes almost 4 years

@user1271772: Agner Fog's instruction tables are available in PDF or spreadsheet form, and the microarch PDF is really useful for understanding what the instruction tables mean. I usually link to agner.org/optimize, not agner.org/optimize/instruction_tables.pdf, when citing data from it. (Although uops.info has even more detailed test results these days, especially for multi-uop instructions with different latencies from different inputs to the output.)
Peter Cordes almost 4 years

@Dolda2000: one-operand mul (32x32 => 64-bit) does have a throughput of one per 2 cycles on Haswell, and costs 3 uops. (2 uops for 64x64=>128-bit, I guess not needing an extra uop to split up the low 64 into two 32-bit halves). The problem is that's not the normal way to multiply. Non-widening multiply is done with imul reg,reg/mem or imul reg, reg/mem, immediate, which is always 1 uop, 1c throughput on Intel SnB-family and on Zen, for any operand-size, only writing one output register.
Nike almost 4 years

Interesting that you found this a year after I wrote my comments here @PeterCordes. I happened to be the same guy that offered the 200 point bounty for the long-integer multiplication Hot Network Question on Matter Modeling Stack Exchange! Now I also see that you commented on my HNQ on Electrical Engineering: electronics.stackexchange.com/q/452181/192433 (which actually arose from this Stack Overflow question).
Nike almost 4 years

@Mysticial Did you notice that your comment became this question? I also don't know if you've had a chance to think about my question in my comment frmo August 2019. (no pressure if you are very busy, I was just very curious).
paperclip optimizer over 2 years

Multiplication in O((log(n) log(log(N))) depth can be accomplished with NTT-based approaches like the done with the Coppersmith-Winograd algorithm which are absolutely impractical for anything less than, say O(1000) bits.
paperclip optimizer over 2 years

I think actual chip implementations use Karatsuba-like schemes which can reduce the surface*cycles from O(N^2) to O(N^1.5), and always with a much higher constant prefactor. O() complexity analysis is useful, but can be often misleading.
user541686 over 2 years

@paperclipoptimizer: You think this based on... what source? This answer says they use Dadda multipliers, which are similar to but slightly faster than Wallace multipliers, which have O(log n) depth.
paperclip optimizer over 2 years

@user541686: Perhaps I interpreted the question slightly differently. From the theoretical standpoint a multiplication is just a convolution with some extra carry steps. This can indeed be carried in O(log N) depth with similar prefactor to carry lookahead addition, but only at O(N^2) transistor count. This is afaik too expensive/impractical for, say, 64 bit (and even lower) integer multiplication in most systems, which is why many chips compromise by using Karatsuba and other fast short convolution techniques at the top level, and shorter-length integer multipliers, which might use Dadda.