Should I use SIMD or vector extensions or something else?

c++ gcc sse simd

18,142

Solution 1

Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:

struct aligned_storage
{
    //overload new and delete for 16-byte alignment
};

class vec4 : public aligned_storage
{
public:
    vec4(float x, float y, float z, float w)
    {
         data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
    }
    vec4(float *data)
    {
         data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
    }
    vec4(const vec4 &rhs)
        : xmm_(rhs.xmm_)
    {
    }
    ...
    vec4& operator*=(const vec4 v)
    {
         xmm_ = _mm_mul_ps(xmm_, v.xmm_);
         return *this;
    }
    ...

private:
    union
    {
        __m128 xmm_;
        float data_[4];
    };
};

Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).

The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.

But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.

JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).

But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.

EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:

#include "vec.h"
#include <iostream>

int main(int argc, char *argv[])
{
    math::vec<float,4> u, v, w = u + v;
    u = v + dot(v, w) * w;
    v = abs(u-w);
    u = 3.0f * w + v;
    w = -w * (u+v);
    v = min(u, w) + length(u) * w;
    std::cout << v << std::endl;
    return 0;
}

and see what VC10 thinks about it:

...
; 6   :     math::vec<float,4> u, v, w = u + v;

movaps  xmm4, XMMWORD PTR _v$[esp+32]

; 7   :     u = v + dot(v, w) * w;
; 8   :     v = abs(u-w);

movaps  xmm3, XMMWORD PTR __xmm@0
movaps  xmm1, xmm4
addps   xmm1, XMMWORD PTR _u$[esp+32]
movaps  xmm0, xmm4
mulps   xmm0, xmm1
haddps  xmm0, xmm0
haddps  xmm0, xmm0
shufps  xmm0, xmm0, 0
mulps   xmm0, xmm1
addps   xmm0, xmm4
subps   xmm0, xmm1
movaps  xmm2, xmm3

; 9   :     u = 3.0f * w + v;
; 10   :    w = -w * (u+v);

xorps   xmm3, xmm1
andnps  xmm2, xmm0
movaps  xmm0, XMMWORD PTR __xmm@1
mulps   xmm0, xmm1
addps   xmm0, xmm2

; 11   :    v = min(u, w) + length(u) * w;

movaps  xmm1, xmm0
mulps   xmm1, xmm0
haddps  xmm1, xmm1
haddps  xmm1, xmm1
sqrtss  xmm1, xmm1
addps   xmm2, xmm0
mulps   xmm3, xmm2
shufps  xmm1, xmm1, 0

; 12   :    std::cout << v << std::endl;

mov edi, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
mulps   xmm1, xmm3
minps   xmm0, xmm3
addps   xmm1, xmm0
movaps  XMMWORD PTR _v$[esp+32], xmm1
...

Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.

So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.

EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.

If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.

Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.

But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.

Solution 2

I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.

Solution 3

I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.

18,142

pearcoding

Hi, my name is Ömercan Yazici and I am a student from Germany. I am the founder of the Pear3DEngine project, the The great Journey game and some other little projects. My hobbies include programming, reading, cycling, reading again, walking, thinking, writing, and again reading, physics, mathematics, biology, philosophy, politics and of course reading. I also write novels... See my webpage or visit me on Twitter ;) #include <iostream> #include <string> int main() { std::string name = "Ömercan Yazici"; std::cout << "Hello World! Here is " << name << "! :D" << std::endl; return 0; }

Updated on September 15, 2022

Comments

pearcoding over 1 year
I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.

Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.

I know that I have more control when I use my own SSE code, but often this control is unnessary.

One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.

Now to my question:
- Should I use naked SSE? Perhaps encapsulated.
```
__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
__m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);

__m128 res = _mm_mul_ps(v1, v2);
```
- Or should the compiler do the dirty work?
```
float v1 = {0.5f, 2, 4, 0.25f};
float v2 = {2, 0.5f, 0.25f, 4};

float res[4];
res[0] = v1[0]*v2[0];
res[1] = v1[1]*v2[1];
res[2] = v1[2]*v2[2];
res[3] = v1[3]*v2[3];
```
- Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.
```
Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);

Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
```
  The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.
I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.

PS: This is not a real problem more a design question.
- Vlad almost 12 years
  
  Why not be lazy and let the compiler do the optimizations, if it is possible?
- Christian Rau almost 12 years
  
  Well, definitely not the 3rd one, at least not with dynamically allocated memory for something that small as vec4 (this is C++ and not Java). You may encapsulate the __m128 into a class to propagate its alignment restrictions (of course you have to take care of dynamic allocation by overloading operator new and specializing std::allocator), but don't ever use dynamic memory allocation for something that simple as a single vec4. This will outweight any possible gain from SSE by a factor of two billion (exaggeration intended).
- BЈовић almost 12 years
  
  @Vlad The compiler doesn't always do it properly. That is when you need to use sseX functions
- Walter almost 12 years
  
  my experience is: the compilers cannot be relied upon, in particular they cannot change your data layout (with its immediate implications on its usability with SSE instructions), so you must design your application carefully and avoid un-aligned data. I prefer method 1, but possibly encapsulated properly.
- Mysticial almost 12 years
  
  I will go further to say that current compilers are almost useless for auto-vectorization. They fail to vectorize many things. And whatever they can vectorize tend to be very simple loops which are likely memory bound. Part of the problem is that they don't have the "big picture" information to do the necessary transformations for vectorization.
pearcoding almost 12 years

+1 So I can use a function like pear3d_multiply3fv(float* vs, int count) :) But it is very hard to center many multiplications on one point... well I should rethink the design but with DOD it shouldn't be very hard.
pearcoding almost 12 years

+1 thanks for the answer. I already use intrinsics but don't use real encapsulation class with overloaded operators. In one book (I don't really remember, maybe Real Time Rendering by Tomas Akenine-Möller, Eric Haines, and Naty Hoffman) this was not really suggested because of the much overhead. I think I should redesign the application and center many math-heavy algorithms in one point :)
Christian Rau almost 12 years

@omercan1993 Check it out. Make some of those vectors and do a bunch of computations with them. I would be surprised if it wouldn't result in just a bunch of SSE instructions without any unneccessary loads or stores (let aside function calls). I don't think a recent enough gcc is in any way worse than VC10 in this regard. Of course at the start of those computation blocks you probably have some loads, but those are there for non-SSE vectors or hand-written intrinsics anyway.
pearcoding almost 12 years

I will try it and post the result here :)
Ben Voigt almost 12 years

@omercan1993: Yes, that's exactly what I'm talking about.
Christian Rau almost 12 years

Well, the compiler (my VC10 at least) is actually quite good at inlining and optimizing those computations to the bare SSE instructions without incurring unneccessary loads and stores (see my answer for a small (and maybe simple, but IMHO rather common) example). Though it may still be possible to create some killing example computations (for which you need a very good ET implementation, too).
Christian Rau almost 12 years

JamesWynn There isn't really any ovrhead when doing all computations with the SSE-implemented vector (see my answer), so I don't think moving everything into a seperate function with handcrafted intrinsics really pays. But you're right in that the major use of SSE code is indeed in some special compuation-heavy blocks.
Ben Voigt almost 12 years

@ChristianRau: I don't know whether your code works in practice or not, but you're well into the land of undefined behavior by using a union to setup your MMX variables instead of the load and store intrinsics.
Ben Voigt almost 12 years

You're violating strict aliasing here... can you provide any documentation that allows this particular case? The problem isn't that the memory layout may be incompatible (also it could be on some unusual platform), it's that the compiler is free to optimize each field of the union separately and perform reordering that breaks your code.
Christian Rau almost 12 years

@BenVoigt Yes, I know it's UB, but c'mon, we're deep in the bowels of hardware with SSE. Just show me an SSE-enabled platforms where this union-trick doesn't work. Of course you're on the safe side of platform-independence when completely dropping SSE anyway, but well.
Ben Voigt almost 12 years

@Christian: Say you read data_[0] and store it into an integer. Then do some MMX operations, and convert data_[0] to an integer again. Because you haven't written to any float variable between the two reads, the optimizer is perfectly Standard-compliant if it reuses the result of the original conversion, instead of generating a second access to data_[0] and converting the new value.
Christian Rau almost 12 years

@BenVoigt Ok, haven't thought about aliasing yet, I have to admit. Needs further investigation. But ok, the data_ alias is usually only used at the peripherals anyway and could be removed. For setting one may need to retreat to _mm_set_ps, though getting is more problematic, probably a _mm_shuffle_ps followed by an _mm_store_ss. But the fast computation stays.