Is inline assembly language slower than native C++ code?

c++ c performance assembly

69,410

Solution 1

Yes, most times.

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

You can always produce an example where handmade assembly code is better than compiled code but usually it's a fictional example or a single routine not a true program of 500.000+ lines of C++ code). I think compilers will produce better assembly code 95% times and sometimes, only some rare times, you may need to write assembly code for few, short, highly used, performance critical routines or when you have to access features your favorite high-level language does not expose. Do you want a touch of this complexity? Read this awesome answer here on SO.

Why this?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds (when we may need days).

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimization and inter-procedural optimization such as register allocation, constant propagation, common subexpression elimination, instruction scheduling and other complex, not obvious optimizations (Polytope model, for example). On RISC architecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISC CPUs have very long pipelines too.

For some complex microcontrollers even system libraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

Compilers sometimes can automatically use some MMX/SIMDx instructions by themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well). Just for loops this is a short list of loop optimizations of what is commonly checked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle (its size is known at compile time). Do it and run your test again.

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitecture and some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGO but in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usually does it better, and for complex tasks usually the work isn't repaid (and compiler may do better anyway).

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language. It's probably worth to mention that often using intrinsics you will have performance gain your're looking for and compiler will still be able to perform most of its optimizations.

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to pay one week of your time or to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.

Solution 2

Your assembly code is suboptimal and may be improved:

You are pushing and popping a register (EDX) in your inner loop. This should be moved out of the loop.
You reload the array pointers in every iteration of the loop. This should moved out of the loop.
You use the loop instruction, which is known to be dead slow on most modern CPUs (possibly a result of using an ancient assembly book*)
You take no advantage of manual loop unrolling.
You don't use available SIMD instructions.

So unless you vastly improve your skill-set regarding assembler, it doesn't make sense for you to write assembler code for performance.

*Of course I don't know if you really got the loop instruction from an ancient assembly book. But you almost never see it in real world code, as every compiler out there is smart enough to not emit loop, you only see it in IMHO bad and outdated books.

Solution 3

Even before delving into assembly, there are code transformations that exist at a higher level.

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
  for (int i = 0; i < TIMES; i++) {
    for (int j = 0; j < length; j++) {
      x[j] += y[j];
    }
  }
}

can be transformed into via Loop Rotation:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      for (int i = 0; i < TIMES; ++i) {
        x[j] += y[j];
      }
    }
}

which is much better as far as memory locality goes.

This could be optimizes further, doing a += b X times is equivalent to doing a += X * b so we get:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      x[j] += TIMES * y[j];
    }
}

however it seems my favorite optimizer (LLVM) does not perform this transformation.

[edit] I found that the transformation is performed if we had the restrict qualifier to x and y. Indeed without this restriction, x[j] and y[j] could alias to the same location which makes this transformation erroneous. [end edit]

Anyway, this is, I think, the optimized C version. Already it is much simpler. Based on this, here is my crack at ASM (I let Clang generate it, I am useless at it):

calcuAsm:                               # @calcuAsm
.Ltmp0:
    .cfi_startproc
# BB#0:
    testl   %edx, %edx
    jle .LBB0_2
    .align  16, 0x90
.LBB0_1:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    imull   $100000, (%rsi), %eax   # imm = 0x186A0
    addl    %eax, (%rdi)
    addq    $4, %rsi
    addq    $4, %rdi
    decl    %edx
    jne .LBB0_1
.LBB0_2:                                # %._crit_edge
    ret
.Ltmp1:
    .size   calcuAsm, .Ltmp1-calcuAsm
.Ltmp2:
    .cfi_endproc

I am afraid I don't understand where all those instructions come from, however you can always have fun and try and see how it compares... but I'd still use the optimized C version rather than the assembly one, in code, much more portable.

Solution 4

Short answer: yes.

Long answer: yes, unless you really know what you're doing, and have a reason to do so.

Solution 5

I have fixed my asm code:

  __asm
{   
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,1
    mov edi,y
label:
    movq mm0,QWORD PTR[esi]
    paddd mm0,QWORD PTR[edi]
    add edi,8
    movq QWORD PTR[esi],mm0
    add esi,8
    dec ecx 
    jnz label
    dec ebx
    jnz start
};

Results for Release version:

 Function of assembly version: 81
 Function of C++ version: 161

The assembly code in release mode is almost 2 times faster than the C++.

View more solutions

69,410

Author by

user957121

A student who major in computer science in HIT.Love Qt and C++, don't hate Java and M$'s products any more, just dislike them.

Updated on July 08, 2022

Comments

user957121 almost 2 years
I tried to compare the performance of inline assembly language and C++ code, so I wrote a function that add two arrays of size 2000 for 100000 times. Here's the code:
```
#define TIMES 100000
void calcuC(int *x,int *y,int length)
{
    for(int i = 0; i < TIMES; i++)
    {
        for(int j = 0; j < length; j++)
            x[j] += y[j];
    }
}


void calcuAsm(int *x,int *y,int lengthOfArray)
{
    __asm
    {
        mov edi,TIMES
        start:
        mov esi,0
        mov ecx,lengthOfArray
        label:
        mov edx,x
        push edx
        mov eax,DWORD PTR [edx + esi*4]
        mov edx,y
        mov ebx,DWORD PTR [edx + esi*4]
        add eax,ebx
        pop edx
        mov [edx + esi*4],eax
        inc esi
        loop label
        dec edi
        cmp edi,0
        jnz start
    };
}
```
Here's main():
```
int main() {
    bool errorOccured = false;
    setbuf(stdout,NULL);
    int *xC,*xAsm,*yC,*yAsm;
    xC = new int[2000];
    xAsm = new int[2000];
    yC = new int[2000];
    yAsm = new int[2000];
    for(int i = 0; i < 2000; i++)
    {
        xC[i] = 0;
        xAsm[i] = 0;
        yC[i] = i;
        yAsm[i] = i;
    }
    time_t start = clock();
    calcuC(xC,yC,2000);

    //    calcuAsm(xAsm,yAsm,2000);
    //    for(int i = 0; i < 2000; i++)
    //    {
    //        if(xC[i] != xAsm[i])
    //        {
    //            cout<<"xC["<<i<<"]="<<xC[i]<<" "<<"xAsm["<<i<<"]="<<xAsm[i]<<endl;
    //            errorOccured = true;
    //            break;
    //        }
    //    }
    //    if(errorOccured)
    //        cout<<"Error occurs!"<<endl;
    //    else
    //        cout<<"Works fine!"<<endl;

    time_t end = clock();

    //    cout<<"time = "<<(float)(end - start) / CLOCKS_PER_SEC<<"\n";

    cout<<"time = "<<end - start<<endl;
    return 0;
}
```
Then I run the program five times to get the cycles of processor, which could be seen as time. Each time I call one of the function mentioned above only.

And here comes the result.

Function of assembly version:
```
Debug   Release
---------------
732        668
733        680
659        672
667        675
684        694
Average:   677
```
Function of C++ version:
```
Debug     Release
-----------------
1068      168
 999      166
1072      231
1002      166
1114      183
Average:  182
```
The C++ code in release mode is almost 3.7 times faster than the assembly code. Why?

I guess that the assembly code I wrote is not as effective as those generated by GCC. It's hard for a common programmer like me to wrote code faster than its opponent generated by a compiler.Does that mean I should not trust the performance of assembly language written by my hands, focus on C++ and forget about assembly language?
- mrivard about 12 years
  
  Pretty much. Handcoded assembly is appropriate in some circumstances, but care must be taken to ensure that the assembly version is indeed faster than what can be achieved with a higher level language.
- Paul R about 12 years
  
  You might find it instructive to study the code generated by the compiler, and try to understand why it's faster than your assembly version.
- David Heffernan about 12 years
  
  Yeah, looks like the compiler is better at writing asm than you. Modern compilers really are quite good.
- Chris about 12 years
  
  Have you looked at the assembly GCC produced? Its possible GCC used MMX instructions. Your function is very parallel - you could potentially use N processors to compute the sum in 1/N th the time. Try a function where there is no hope for parallelization.
- PlasmaHH about 12 years
  
  Hm, I would have expected a good compiler to do this ~100000 times faster...
- harold about 12 years
  
  No surprises there. If you're going to this, at least do it right.
- R4D4 about 12 years
  
  It won't matter too much in this application but for your future notice, when you are measuring clock cycles for a program like this where you have no user input you really should have your process set itself to a realtime priority before you start to measure clock cycles to get a much more accurate measurement (although that won't change the conclusion of your results here ;).
- Matthieu M. about 12 years
  
  @PlasmaHH: actually I was quite surprised, but Clang/LLVM does not optimize the loop over TIMES away. I expected it to be simplified into x[j] += TIMES * y[j] but it did not happen. Even when interchanging the loops manually to makes the loop over TIMES the inner one it still did not. shoking
- PlasmaHH about 12 years
  
  @MatthieuM.: maybe some obscure language rules are preventing this? Or its time for a bug/enhancement report...
- Matthieu M. about 12 years
  
  @PlasmaHH: Actually this becomes possible with the restrict qualifier, however Clang/LLVM fails to interchange the loops automatically and only optimizes this if the loops are interchanged...
- tylerl about 12 years
  
  One of the classic college assignments in processor design is to take the unoptimized compiler output and manually tweak the assembly till it runs significantly faster (e.g. at least 100x as fast). It's a fun and instructive exercise.
- pyCthon about 12 years
  
  curious what compiler flags did you use in both cases?
- Dax Fohl about 12 years
  
  Sure you should trust the performance of your assembly; it'll be exactly what you specify! Actually you shouldn't trust the performance of the code generated by the compiler--the result will be much faster than what you think.
- Matthieu M. about 12 years
  
  @DaxFohl: assembly performance is not easy to estimate when today's bottleneck is memory (in most cases) and not instruction count. Memory access patterns are most often the critical piece (playing nice with prefetching, avoiding branches) and whether you use C or assembly is not so important.
- oksayt about 12 years
  
  Epic answer from another question: stackoverflow.com/a/2685541/372860
- Ami about 12 years
  
  You should post the assembly code that your compiler generates.
- Morg. almost 10 years
  
  The only way you can compare, is taking the assembly from the compiler, improving on that as much as you can, and then benchmark. If you can't, the compiler is better than you and you live in the happy world of "no point checking the ASM ever". If you can, welcome to hell where you can't trust any of your tools.
Gunther Piez about 12 years

Still true, see stackoverflow.com/questions/1396527/…. Not because of the used cycles, but because of the reduced memory footprint.
user957121 about 12 years

@drhirsch It's heartbreaking to hear from you:),but I know what's my work and what's not.Thank you all!
John Alexiou about 12 years

So the compiler is better at writing code than people are. Good to know.
Adriano Repetti about 12 years

Of course not. I think it's better of 95% of people in 99% of times. Sometimes because it's simply to costly (because of complex math) or time spending (then costly again). Sometimes because we simply did forget about optimizations...
Mike Baranczak about 12 years

@ja72 - no, it's not better at writing code. It's better at optimizing code.
Mooing Duck about 12 years

There's other places for ASM besides those two. Namely, a bignum library will usually be significantly faster in ASM than C, due to having access to carry flags and the upper part of multiplication and such. You can do these things in portable C as well, but they're very slow.
Bill K about 12 years

It's counter-intuitive until you really consider it. In the same way, VM based machines are starting to make runtime optimizations that compilers simply don't have the information to make.
Billy ONeal about 12 years

@Bill Not really true. Compilers have just as much information as VMs do. Tis just that compilers are still usually (minutely) faster than VMs without considering such things. (Any statistical information a VM has about the code can be provided to a compiler if it supports profile guided optimization, as most compilers do nowadays) Whereas it's trivial to make either a compiler or VM that considers more optimizations than would be practical to do by hand.
Admin about 12 years

@Billy VMs know the CPU architecture and which instructions it supports, the compiler can't use these instructions because it must be cautious and not use them to support CPUs that don't support them.
Billy ONeal about 12 years

@M28: Compilers can use the same instructions. Sure, they pay for it in terms of binary size (because they have to provide a fallback path in the event those instructions aren't supported). Also, for the most part, the "new instructions" that would be added are SMID instructions anyway, which both VMs and Compilers are pretty horrible at utilizing. VMs pay for this feature in that they have to compile the code at startup.
Bill K about 12 years

@Billy a VM can notice that a given use case more often than not selects values that cause a routine not to do any work within a given method call--it can then optimize that call out completely. If a compiler could do that, it could only be through a runtime (and then it has a runtime and falls on the other side of the discussion). A runtime-compiled language can also adopt new optimizations without a manual recompile step--say for a new CPU instruction (or a new cpu)... I'm just saying that in many cases "better" is not as obvious as one would think.
Billy ONeal about 12 years

@BillK: PGO does the same thing for compilers.
Casey Rodarmor about 12 years

I think this depends on the language and compiler. I can imagine an extremely inefficient C compiler whose output could easily be beaten by a human writing straightforward assembly. The GCC, not so much.
Gunther Piez about 12 years

@BillK You are a java guy, aren't you? Sometimes it shows ;-) Ok, now I need to shut up ;-)
Bill K about 12 years

I don't particularly try to hide it, but that's a pretty strange way to categorize the world into "Java people" and "Not Java People". You could also call me a "Not holding lit stick of dynamite" guy--which is even more accurate because although I spent years developing professionally in C and C++ (and various other languages) I Never held any anything bigger than a lit firecracker in my hands.
Sjoerd about 12 years

-1: I don't see any OO feature being used. Your argument is the same as "assembly could also be faster if your compiler adds a million NOPs."
Olof Forshell about 12 years

I was unclear, this is a actually a C question. If you write C code for a C++ compiler you aren't writing C++ code and you won't get any OO stuff. Once you start writing in real C++, using OO stuff you have to be very knowledgeable to get the compiler to not generate OO support code.
Matthieu M. about 12 years

@BillK: I think drhirsch was just making fun of your insistance that VMs are the performance killers, when they are just using traditional compilation/optimization technics. Regarding "avoiding the manual recompile", it's of little interest to most people really interested in performance. Whether you start from some binary intermediate language or from the source code to produce the new optimized binary (whether native or not) is the same principle as far as compiler go, and people really interested in performance will use it, even if it incurs a manual operation.
user957121 about 12 years

Thanks for your answer.Well,it's a little confusing that when I took the class named "Compiler principles",I learned that the compiler will optimize our code by many means. Does that mean we need to optimize our code manually?Can we do a better job than the compiler? That's the question that always confuse me.
Matthieu M. about 12 years

@user957121: we can optimize it better when we have more information. Specifically here what hinders the compiler is the possible aliasing between x and y. That is, the compiler cannot be sure that for all i,j in [0, length) we have x + i != y + j. If there is overlap, then optimization is impossible. The C language introduced the restrict keyword to tell the compiler that two pointers cannot alias, however it does not work for arrays because they can still overlap even if they don't exactly alias.
fortran about 12 years

@MooingDuck That might be considered as accessing hardware hardware features that are not directly available in the language... But as long as you are just translating your high level code to assembly by hand, the compiler will beat you.
Mooing Duck about 12 years

it is that, but it is not kernel programming, nor vendor specific. Though with slight workding changes, it could easily fall into either category. Id guess ASM when you want the performance of processor instructions that have no C mapping.
Mooing Duck about 12 years

so your answer isnt about the question? (Also, clarifications go in the answer, not comments. Comments can be deleted anytime with no notice, notification, or history.
Olof Forshell about 12 years

I think it is in order to write that especially for a modern x86 processor it is exceptionally difficult to write efficient assembly code due to the presence of pipelines, multiple execution units and other gimmicks inside every core. Writing code that balances the usage of all these resources in order to get the highest execution speed will often result in code with unstraightforward logic that "shouldn't" be fast according to "conventional" assembly wisdom. But for less complex CPUs it is my experience that the C compiler's code generation can be bettered significantly.
josesuero about 12 years

The C compilers code can be usually be bettered, even on a modern x86 CPU. But you have to understand the CPU well, which is harder to do with a modern x86 CPU. That's my point. If you don't understand the hardware you're targeting, then you won't be able to optimize for it. And then the compiler will likely do a better job
Bill K about 12 years

I understand and I wasn't upset or anything--I just find it kind of amusing, when I was young the argument was that C could never be as fast as assembly, now VMs can never be as fast as compiled, I wonder what it will look like in 10 years. Stuff isn't always intuitive.
leftaroundabout about 12 years

Not sure what exactly you mean by OO "support code". Of course, if you use a lot of RTTI and suchlike, the compiler will have to create lots of extra instructions to support those features – but any problem that's sufficiently high-level to ratify use of RTTI is too complex to be feasibly writable in assembly. What you can do, of course, is write only the abstract outside interface as OO, dispatching to performance-optimized pure procedural code where it's critical. But, depending on the application, C, Fortran, CUDA or simply C++ without virtual inheritance might be better than assembly here.
ssube about 12 years

With C/++ compilers being such an undertaking, and only 3 major ones around, they tend to be rather good at what they do. It's still (very) possible in certain circumstances that hand-written assembly will be faster; a lot of math libraries drop to asm to better handle multiple/wide values. So while guaranteed is a bit too strong, it is likely.
vsz about 12 years

@peachykeen: I did not mean that assembly is guaranteed to be slower than C++ in general. I meant that "guarantee" in the case where you have a C++ code and blindly translate it line by line to assembly. Read the last paragraph of my answer too :)
user957121 about 12 years

Yeah,my code really needs to be optimized.Good work for you and thanks!
Gunther Piez about 12 years

It is four times faster because you only do a quarter of the work :-) The shr ecx,2 is superfluous, because the array length is already given in int and not in byte. So you basically achieve the same speed. You could try the paddd from harolds answer, this will really be faster.
Gunther Piez about 12 years

Now if you start using SSE instead of MMX (register name is xmm0 instead of mm0), you will get another speedup by a factor of two ;-)
sasha about 12 years

I changed, got 41 for assembly version. It is in 4 times faster :)
Billy ONeal about 12 years

@BillK: Of course a VM can be as fast as compiled; after all, most VMs are JIT compilers. It's just that most VMs aren't often willing to pay the optimization time cost to get there. Some of the modern VMs are really really good at placing optimization bets though. My point above is just that there are always tradeoffs, and that generally speaking compilers and VMs will (generally speaking_ be comparable with similar written code. The bigger overhead with most "VM" languages is garbage collection rather than JIT compilation. (Though GC has perf advantages too; e.g. allocation is really fast)
sasha about 12 years

also can get up to 5% more if use all xmm registers
Hawken about 12 years

I believe one of Michael Abrash books is the graphics programming black book. But he's not the only one to use assembly, Chris Sawyer wrote the first two roller coaster tycoon games in assembly by himself.
Hawken about 12 years

@leftaroundabout the feasability of writing something in assembly will also be highly dependent on the programmer.
Mark Mullin almost 12 years

and then only if you've run an assembly level profiling tool like vtune for intel chips to see where you may be able to improve on things
Hawken over 11 years

@fortran Your basically just saying if you don't optimize your code it won't be as fast as the code the compiler optimized. The optimization is the reason one would write assembly in the first place. If you mean translate then optimize there is no reason the compiler will beat you unless you aren't good at optimizing assembly. So to beat the compiler you have to optimize in ways the compiler cannot. It's pretty self explanatory. The only reason to write assembly is if you are better than a compiler/interpreter. That's always been the practical reason to write assembly.
Hawken over 11 years

And if you really want to blow the compiler away you have to be creative and optimize in ways the compiler cannot. It's a tradeoff for time/reward that's why C is a scripting language for some and intermediate code for a higher level language for others. For me though, assembly is more for the fun :). much like grc.com/smgassembly.htm
fortran over 11 years

@Hawken What I'm saying is that in 99.9999% (one in a million and I might be short) of the cases you aren't good at optimizing assembly. Plain simple, get over it.
Zane over 11 years

No. At least not very likely. There is a thing in C++ called the zero overhead rule, and this applies most of the time. Learn more about OO - you will find out that in the end it improves readability of your code, improves code quality, increases coding speed, increases robustness. Also for embedded - but use C++ as it gives you more control, embedded+OO the Java way will cost you.
phuclv almost 10 years

compilers may still emit loop (and many "deprecated" instructions) if you optimize for size
gnasher729 almost 10 years

Just saying: Clang has access to the carry flags, 128 bit multiplication and so on through built-in functions. And it can integrate all these into its normal optimisation algorithms.
Calimo almost 10 years

Now if you think about the time it actually took you: assembly, about 10 hours or so? C++, a few minutes I guess? There's a clear winner here, unless it is performance-critical code.
Navin over 8 years

This technically answers the question but is also completely useless. A -1 from me.
Basile Starynkevitch over 8 years

Wrong. No relation to OO. Compilers can optimize better than human programmers.
Olof Forshell over 8 years

@Basile Starynkevitch: compilers and their optimization strategies were written by programmers so there are some programmers that can optimize better than compilers. Since OO, generally speaking, is less WYSIWYG than, say, C it is not a bad idea to learn what OO constructs affect performance negatively and utilize other constructs instead.
vsz over 8 years

And unless a is volatile, there is a good chance that the compiler will just do int a = 13; from the very beginning.
Johan about 8 years

I think the complex addressing is slowing your code down, if you change the code to mov ecx, length, lea ecx,[ecx*4], mov eax,16... add ecx,eax and then just use [esi+ecx] everywhere you'll avoid 1 cycle stall per instruction speeding up the loop lots. (If you have the latest Skylake then this does not apply). The add reg,reg just makes the loop tighter, which may or may not help.
harold about 8 years

@Johan that shouldn't be a stall, just an extra cycle latency, but sure it can't hurt to not have it.. I wrote this code for Core2 which didn't have that issue. Isn't r+r also "complex" btw?
Peter Cordes over 7 years

When you just want to beat the compiler, it's usually easier to take its asm output for your function and turn that into a stand-alone asm function that you tweak. Using inline asm is a bunch of extra work to get the interface between C++ and asm correct and check that it's compiling to optimal code. (But at least when just doing it for fun, you don't have to worry about it defeating optimizations like constant-propagation when the function inlines into something else. gcc.gnu.org/wiki/DontUseInlineAsm).
Peter Cordes over 7 years

See also the Collatz-conjecture C++ vs. hand-written asm Q&A for more on beating the compiler for fun :) And also suggestions on how to use what you learn to modify the C++ to help the compiler make better code.
madoki over 7 years

@PeterCordes So what you're saying is you agree.
Peter Cordes over 7 years

Yes, asm is fun, except that inline asm is usually the wrong choice even for playing around. This is technically an inline-asm question, so it would be good to at least address this point in your answer. Also, this is really more of a comment than an answer.
madoki over 7 years

OK agreed. I used to be an asm only guy but that was the 80ies.
Peter Cordes over 7 years

These days compilers can sometimes make near-optimal asm if you hand-hold them to it by modifying your C source. And sometimes they spot an optimization you didn't while you're doing that. But this can get un-fun quickly in cases where they don't "see" the asm trick you want them to use.
Tommylee2k about 5 years

Very long answer: "Yes, unless you feel like changing your whole code whenever a new(er) CPU is used. Pick the best algorithm, but let the compiler do the optimization"
Peter Cordes about 5 years

Current GCC and Clang auto-vectorize (after checking for non-overlap if you omit __restrict). SSE2 is baseline for x86-64, and with shuffling SSE2 can do 2x 32-bit multiplies at once (producing 64-bit products, hence the shuffling to put the results back together). godbolt.org/z/r7F_uo. (SSE4.1 is needed for pmulld: packed 32x32 => 32-bit multiply). GCC has a neat trick of turning constant integer multipliers into shift/add (and/or subtract), which is good for multipliers with few bits set. Clang's shuffle-heavy code is going to bottleneck on shuffle throughput on Intel CPUs.
Peter Cordes almost 5 years

A good compiler will already auto-vectorize with paddd xmm (after checking for overlap between x and y, because you didn't use int *__restrict x). For example gcc does that: godbolt.org/z/c2JG0-. Or after inlining into main, it shouldn't need to check for overlap because it can see the allocation and prove they're non-overlapping. (And it would get to assume 16-byte alignment on some x86-64 implementations, too, which isn't the case for the stand-alone definition.) And if you compile with gcc -O3 -march=native, you can get 256-bit or 512-bit vectorization.
hanshenrik almost 5 years

You can always produce an example where handmade assembly code is better than compiled code - in theory, that's not quite true. there's no reason a compiler could not emit the absolutely quickest possible assembly code (it's just very unlikely), and likewise, when writing in assembly it's also theoretically possible to write the absolutely quickest possible assembly code (it's just very unlikely, i think these guys tried though)
Adriano Repetti almost 5 years

@hans I agree, it's possible, indeed, and extremely unlikely (given a finite compilation time and finite resources for writing the optimizer). Interesting code BTW, it would be nice to see a comparison with good C code but in this case I admit that assembly might even be more portable across compilers than intrinsics in C.
IGR94 over 4 years

@phuclv well yes, but original question was exactly about speed, not size.
supercat over 2 years

Getting clang and gcc to produce optimal machine code for platforms like the Cortex-M0 is hard, even when performing tasks which should be simple (e.g. for (int i=0; i<n; i+=2) foo[i] += 0x12345678; hand optimizing the assembly to five instructions per loop, or 14 per 4x-unrolled loop, is pretty simple. Coaxing clang to generate a five-instruction loop is hard, and I can't figure out any way to write the code so gcc will do so, though curiously enough I could manage a six-instruction loop in -O0 which was better than the eight-instruction loop gcc produced from the same code...
supercat over 2 years

...wish optimizations enabled.
Spencer almost 2 years

Finally, a correct use of "plethora"!