Using Assembly Language in C/C++

c++ c optimization compiler-optimization assembly

17,054

Solution 1

The only time it's useful to revert to assembly language is when

the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)
- AND the compiler doesn't provide extra functions to wrap these operations (e.g. C++11 Standard has atomic operations including compare-and-swap, <cstdlib> has div/ldiv et al for getting quotient and remainder efficiently)
- AND there isn't a good third-party library (e.g. http://mitpress.mit.edu/catalog/item/default.asp?tid=3952&ttype=2)
OR
for some inexplicable reason - the optimiser is failing to use the best CPU instructions

...AND...

the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.

Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:

the compiler knows how to do this equally well
- to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
- compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
  - in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
other programmers may not know or be comfortable with assembly

One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.

Solution 2

First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.

Solution 3

(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:

__asm
{
    mov eax, ebx
}

(2) This is highly subjective

(3) Because you might be able to write more effective assembly code than the compiler generates.

Solution 4

You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.

Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.

One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).

Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.

There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.

Solution 5

I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.

X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.

Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.

True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.

The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.

Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.

So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.

I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.

View more solutions

17,054

Srikar Appalaraju

Hi I am Srikar, I think I have been programmed to feel that I need to write this message...

Updated on December 17, 2020

Comments

Srikar Appalaraju over 3 years
I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -
1. Is this practice still done? and How does one do this?
2. Isn't writing in Assembly Language a bit too cumbersome & archaic?
3. When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?
I am trying to understand this concept & any help or links is much appreciated.

UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.
- Armen Tsirunyan over 13 years
  
  Nice question, correctly phrased. +1
- Stephan B. over 13 years
  
  This is probably one of only five questions on SO where using "C/C++" makes sense.
Srikar Appalaraju over 13 years

profile my program? You mean this would help me decide if I want to use Assembly?
In silico over 13 years

@MovieYoda: No, it helps you figure out where the bottleneck is. That way, you don't waste your time trying to optimize a piece of code that isn't even a major factor in performance. Generally, writing assembly in C or C++ code should be done only as a very last resort. Often, just using different algorithms or data structures will speed up code.
andrewmu over 13 years

Yes as it will tell you where your program is spending most of it's time and would benefit from optimisation. However you should look to see if your code would benefit from a better algorithm than brute force assembler.
sharptooth over 13 years

@MovieYoda: Yes, you might find such dumb pieces of code that just rewriting them (still in C or C++) will give a tremendous boost. For example, if you call strlen() in a loop while the string length doesn't change rewriting that in assembler is waste of time - you just use a temporary variable to store length and (magic!) you program likely runs noticeably faster.
andrewmu over 13 years

I think games programmers are the only people who use ASM in programming nowadays.
Srikar Appalaraju over 13 years

so you mean compiler is good enough but if compiler fails to optimize certain sections then use assembly?
In silico over 13 years

@VJo: Note that the article covers optimization of math-intensive routines via the processor's instruction set. In that specific case, writing assembly may be a benefit, but not in the general case.
sharptooth over 13 years

@MovieYoda: No compiler will help against really dumb code - first profile the program and try to optimize it without assembler.
In silico over 13 years

@MovieYoda: For some very special cases, one may be able to take advantage of the available hardware. Generally, however, writing inline assembly in C++ is not done often as the compiler does a good enough job of optimizing code (assuming non-WTF code), and the smarter ones may sometimes optimize better than by hand since optimizations can be very counterintuitive.
Srikar Appalaraju over 13 years

cool! So this practice is called 'inline assembly'! Nice... So basically, this practice is severely hardware & platform dependent? because each hardware & platform have small variations in their instruction set?
Morfildur over 13 years

You might want to change (3) to Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.
Srikar Appalaraju over 13 years

@sharptooth got it. Loving the links you all are sharing.
Andreas Brinck over 13 years

I think "might" covers it, I don't think you can be more quantitative than that.
Srikar Appalaraju over 13 years

aaah! now I remember where I read about this practice. It was written about the game "Need for Speed" using assembly. Naturally I was stunned
CB Bailey over 13 years

I disagree with (1). The easiest way is usually with 'out of line' assembly source files. This way you get proper syntax highlighting and can use an assembler designed for humans with useful features such as more powerful macros. I usually recommend yasm.
BЈовић over 13 years

Writing assembly code better then what compiler produce is hard, and should be done only after lots of profiling on a big data set.
Andreas Brinck over 13 years

@Charles I meant easy as in easy to try out. I agree that if you're going to do a lot of assembly coding you're better of with an external assembler.
kriss over 13 years

@graham.reeds:it was true a few years back, but with GPU layers like CUDA I'm not sure it's still true for game programmers. There is still some small spots for kernel or driver programming, or some embedded devices.
Goz over 13 years

@Kriss: Assembler will ALWAYS be used in games development. Regardless of that using assembler is incredibly useful on any platform including PCs. I had some audio convolution code that I re-wrote in assembler and got a 5 times speed up of the convolution over straight C.
flacs over 13 years

If someone actually did write assembly code with the intention of optimizing a certain code fragment for speed, he would have to know what CPU the code will be running on and how this one particular CPU works internally. Most modern CPUs are able to execute multiple instructions simultaneously (in one core) by analyzing which instructions don't depend on the result of others plus a whole lot of other means of speeding up program execution.
Benoît over 13 years

FMI, do you manage multiple targets ? or using standard x86 instructions (not SSE or others) ?
steabert over 13 years

Just wanted to add that for this kind of mathematical intensive stuff, there are libraries like Intel MKL or AMD CML which contain highly optimized functions which use assembly kernels. nag.co.uk/industryArticles/HighPerformanceMathLibraries.pdf
Gunther Piez over 13 years

And there a people (most likely not the ones who ask this kind of question) who know the internals of cpus, write different code paths for diffents kinds of cpus and are actually able to produce faster code than any compiler. See agner.org/optimize for some interesting stuff. So replacing "anyone" in (3) of your answer by "most people" would be more corrrect.
daramarak over 13 years

I think you overestimate the compilers ability to understand the program. Yes, the compiler would know how to shuffle instructions to optimize the use of the pipeline. But it knows very little about which variables/functions depend on each other. Because of this you are able, and I others in my team has written assembler code that outperforms the compiler.
kriss over 13 years

@Goz: i'm old enough to be quite sceptic when I hear always. It may still be useful to use assembler for a while, but you should not bet on the future. For now even in games, very few people works on game engines (where assembly is useful), and the same engines are used in many games. When optimization becomes hard enough, you get the Duke Nukem Forever effect. You haven't yet finished optimizing that the next hardware generation is here and you have to restart from scratch because everything changed and your old optimized code is now less efficient than compiled code on new harware...
kriss over 13 years

@Goz: and also 5 times is not much. In the past say 15 years ago, I used to optimize low level game code to assembly. And I rarely got code less than 10 times faster, quite often 50 times faster. That may show how compilers evolved from that time. Also being aware of all cache, behavior, reordering, instruction fetching effects, specialized instruction set, etc. is not easy.
daramarak over 13 years

@dbemerlin you do not need to be an expert to optimize compiler generated code. You just need to find the right spot, and know something that the compiler do not take into consideration. Looking at the generated code is the best. Often you will find that the compiler safeguards where no such safeguarding is necessary. Skipping one load in the core of a loop, might do marvels on the right spot in the code.
Goz over 13 years

@Kriss: Perhaps always is a bit much but if I ever come across a vectorising compiler that can vectorise better than I can, I'll eat my hat (The chocolate one ;)).
Mike Dunlavey over 13 years

@MovieYoda: Here's a piece I did (stackoverflow.com/questions/926266/…) showing how to find code that is actually worth optimizing, and cycle-squeezing (like writing asm) is almost never what is needed.
Srikar Appalaraju over 13 years

Well, I wasn't looking for any specific processor. I wanted to understand this practice & the reasons why one would take this approach. Just updating my knowledge...
Peter Cordes almost 5 years

It's not just using new instructions that makes a difference. Tuning choices like whether / how much to unroll, which instructions to use (loop vs. dec/jnz, sub/mov vs. push) changed immensely between 8086 and 686. And 586 in-order superscalar pentium was an outlier where it could pipeline simple instructions, making it worth it to use more simpler instructions vs. fewer complex instructions. Later CPUs can decode complex ones to multiple uops, but 586 couldn't and would just stall the pipeline.
Peter Cordes almost 5 years

Also, tuning for 8086 = usually minimize code size because instruction fetch was the major bottleneck. Tuning for modern x86 = minimize uop count, and latency of dependency chains. Anyway yes, unless you need to tune the hell out of one hot loop for a limited set of CPU microarchitectures, you don't need hand-written asm. Compilers are pretty good, but certainly do have missed optimizations all over the place. But usually pretty minor, especially if you're running on modern x86 with wide pipelines to eat up wasted instructions so you still bottleneck mostly on memory.
Peter Cordes almost 5 years

MSVC inline assembly only works on 32-bit x86, so it's pretty bad and pretty much obsolete. GNU C inline assembly might be a better example because it can efficiently wrap a single instruction without forcing the compiler to bounce the input through memory. gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. and stackoverflow.com/tags/inline-assembly/info. Of course gcc.gnu.org/wiki/DontUseInlineAsm if you can get the compiler to produce as good asm (for current CPUs) using intrinsics or pure C; that will be more future proof.
Peter Cordes almost 5 years

The main reason for knowing asm is to tweak your C to compile more efficiently, or to be a compiler developer. Not to actually write asm.
Kaz almost 5 years

The easiest way is definitely not inline assembly. Firstly, the syntax varies from compiler to compiler. Secondly, your code is inserted into surrounding compiler-generated code where you have no clue what registers you are allowed to use (or else you have to learn a special language for using compiler-allocated registers, like with GCC). By far the easiest way to use assembly from C is to write separate functions in an assembly language file, interfacing with C using the platform's documented calling conventions.
GoodJuJu over 3 years

Welcome to SO! Please read the tour tour and How to Answer a question. The question was asked 10 years ago and has accepted answers.
Peter Cordes over 3 years

GNU C Basic asm (with just a string of instructions, no input/output/clobbers constraints) is obsolete and dangerous, and can't safely be used for much of anything. See gcc.gnu.org/wiki/ConvertBasicAsmToExtended. e.g. referencing global variables by symbol name is not safe, neither is modifying any registers, and (in x86-64) neither is using any stack space (unless you skip the red zone). Never use it inside a function. See stackoverflow.com/tags/inline-assembly/info for guides to GNU C Extended asm. (e.g. asm ("add %1, %0" : "+r"(var) : "r"(var)))