Using SSE instructions

c++ optimization assembly processor sse

37,976

Solution 1

SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.

Solution 2

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.

Problems:

1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:

a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
  a += 2;
  array [index] = a;
}
++index;

is not easy to do as SIMD:

a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask         a2 &= mask           a3 &= mask           a4 &= mask
a1 >>= shift       a2 >>= shift         a3 >>= shift         a4 >>= shift
if (a1<somevalue)  if (a2<somevalue)    if (a3<somevalue)    if (a4<somevalue)
  // help! can't conditionally perform this on each column, all columns must do the same thing
index += 4

2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome

3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.

You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

Solution 3

This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.

My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.

Something like this:

inline int 
minimum(int a, int b)
{
  int mask = (a - b) >> 31;
  return ((a & mask) | (b & ~mask));
}

Solution 4

If you use SSE instructions, you're obviously limited to processors that support these. That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)

SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)

In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).

Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)

But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)

It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.

I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.

Solution 5

If you intend to use Microsoft Visual C++, you should read this:

http://www.codeproject.com/KB/recipes/sseintro.aspx

View more solutions

37,976

Author by

Kiran Kumar

Software developer from Bangalore, my area of interest is mainly C++. 33rd to C++ Gold Badge

Updated on February 28, 2020

Comments

Kiran Kumar about 4 years

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?