When is assembly faster than C?

1,458

Solution 1

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see


C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Solution 2

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.

I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.

I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

Solution 3

Pretty much anytime the compiler sees floating point code, a hand written version will be quicker if you're using an old bad compiler. (2019 update: This is not true in general for modern compilers. Especially when compiling for anything other than x87; compilers have an easier time with SSE2 or AVX for scalar math, or any non-x86 with a flat FP register set, unlike x87's register stack.)

The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):

#include "stdafx.h"
#include <windows.h>

float KahanSum(const float *data, int n)
{
   float sum = 0.0f, C = 0.0f, Y, T;

   for (int i = 0 ; i < n ; ++i) {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum(const float *data, int n)
{
  float result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int count = 1000000;

  float *source = new float [count];

  for (int i = 0 ; i < count ; ++i) {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER start, mid, end;

  float sum1 = 0.0f, sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

And some numbers from my PC running a default release build*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarfs other optimisations. (Editor's note: more likely the FP latency bottleneck is enough to hide the extra cost of loop. Doing two Kahan summations in parallel for the odd/even elements, and adding those at the end, could maybe speed this up by a factor of 2.)

Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.

Solution 4

Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.

In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).

On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.

Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.

Solution 5

In my job, there are three reasons for me to know and use assembly. In order of importance:

  1. Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.

  2. Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:

    for (int y=0; y < imageHeight; y++) {
        for (int x=0; x < imageWidth; x++) {
           // do something
        }
    }
    

    the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.

  3. doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

Share:
1,458
Sarath
Author by

Sarath

Updated on August 28, 2022

Comments

  • Sarath
    Sarath over 1 year

    I am trying to integrate slick carousel in my Angular 8 application. I tried to lazy load script similarly like this link https://stackblitz.com/edit/angular-slick-carousel?file=src%2Fapp%2Fapp.component.ts .But Carousel is not rendered for async data.

    The data inside the slick carousel container are dynamic which is retrived from an API call.

    <div class="screenshots-carousel slick-container">
        <img [src]="image" *ngFor="let image of screenshots">
    </div>
    

    The screenshot is an async variable

    this.service.getdata(url).subscribe((response) => {
         this.screenshots = response
    }
    

    when i click on the navigation link and land in the page the carousel is not rendered but when i refresh the page, i am getting the expected output.

    • Suresh Kumar Ariya
      Suresh Kumar Ariya over 4 years
      Are you calling the service after the carousel lazy loaded?. can you try setTimeout with 2000 seconds?
    • Sarath
      Sarath over 4 years
      @SureshKumarAriya I added setTimeout and it worked. Thanks