Load constant floats into SSE registers

10,516

Solution 1

If you want to force it to a single load, you could try (gcc):

__attribute__((aligned(16))) float vec[4] = { 1.0f, 1.1f, 1.2f, 1.3f };
__m128 v = _mm_load_ps(vec); // edit by sor: removed the "&" cause its already an address

If you have Visual C++, use __declspec(align(16)) to request the proper constraint.

On my system, this (compiled with gcc -m32 -msse -O2; no optimization at all clutters the code but still retains the single movaps in the end) creates the following assembly code (gcc / AT&T syntax):

    andl    $-16, %esp
    subl    $16, %esp
    movl    $0x3f800000, (%esp)
    movl    $0x3f8ccccd, 4(%esp)
    movl    $0x3f99999a, 8(%esp)
    movl    $0x3fa66666, 12(%esp)
    movaps  (%esp), %xmm0

Note that it aligns the stackpointer before allocating stackspace and putting the constants in there. Leaving the __attribute__((aligned)) out may, depending on your compiler, create incorrect code that doesn't do this, so beware, and check the disassembly.

Additionally:
Since you've been asking for how to put constants into the code, simply try the above with a static qualifier for the float array. That creates the following assembly:

    movaps  vec.7330, %xmm0
    ...
vec.7330:
    .long   1065353216
    .long   1066192077
    .long   1067030938
    .long   1067869798

Solution 2

First off, what optimization level are you compiling at? It's not uncommon to see that sort of codegen at -O0 or -O1, but I would be quite surprised to see it with -O2 or higher in most compilers.

Second, there are no immediate loads in SSE. You can do a load immediate to a GPR, then move that value to SSE, but you cannot conjure other values without an actual load (ignoring certain special values like 0 or (int)-1, which can be produced via logical operations.

Finally, if the bad code is being generated with optimizations turned on and in a performance-critical location, please file a bug against your compiler.

Solution 3

Normally constants such as this would be loaded prior to any loops or "hot" parts of the code, so performance should not be that important. But if you can't avoid doing this kind of thing inside a loop then I would try _mm_set_ps first and see what that generates. Also try ICC rather than gcc, as it tends to generate better code.

Solution 4

Generating constants is much simpler (and quicker) if the four float constants are the same. For example the bit pattern for 1.f is 0x3f800000. One way this can be generated using SSE2

        register __m128i onef;
        __asm__ ( "pcmpeqb %0, %0" : "=x" ( onef ) );
        onef = _mm_slli_epi32( onef, 25 );
        onef = _mm_srli_epi32( onef, 2 );

Another approach with SSE4.1 is,

        register uint32_t t = 0x3f800000;
        register __m128 onef;
        __asm__ ( "pinsrd %0, %1, 0" : "=x" ( onef ) : "r" ( t ) );
        onef = _mm_shuffle_epi32( onef, 0 );

Note that i'm not possitive if this version is any faster than the SSE2 one, have not profiled it, only tested the result was correct.

If the values of each of the four floats must be different, then each of the constants can be generated and shuffled or blended together.

Wether or not this is useful depends on if a cache miss is likely, else loading the constant from memory is quicker. Tricks like this are very helpful in vmx/altivec, but large caches on most pcs may make this less useful for sse.

There is a good discussion of this in Agner Fog's Optimization Manual, book 2, section 13.4, http://www.agner.org/optimize/.

Final note, the use of inline assembler above is gcc specific, the reason is to allow the use of uninitialized variables without generating a compiler warning. With vc, you may or may not need to first initialize the variables with _mm_setzero_ps(), then hope that the optimizer can remove this.

Share:
10,516

Related videos on Youtube

coderdave
Author by

coderdave

Professional console game developer

Updated on May 20, 2022

Comments

  • coderdave
    coderdave over 1 year

    I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this,

    const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f }; 
    

    but that generates 4 movss instructions from memory!

    movss       xmm0,dword ptr [__real@3f800000 (14048E534h)] 
    movss       xmm1,dword ptr [__real@40000000 (14048E530h)] 
    movaps      xmm6,xmm12 
    shufps      xmm6,xmm12,0C6h 
    movss       dword ptr [rsp],xmm0 
    movss       xmm0,dword ptr [__real@40400000 (14048E52Ch)] 
    movss       dword ptr [rsp+4],xmm1 
    movss       xmm1,dword ptr [__real@40a00000 (14048E528h)] 
    

    which load the scalars in and out of memory... (?!?!)

    Doing this though..

    float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope
    

    generates.

    movaps      xmm5,xmmword ptr [::myarray4 (140512050h)]
    

    Ideally, it would be nice if I have constants their would be a way not to even touch memory and just do it with immediate style instructions (e.g. the constants compiled into the instruction itself).

    Thanks

  • coderdave
    coderdave over 12 years
    I'm using visual studio and _mm_set_ps is generating more movss. I think the visual studio compiler is just pretty terrible.
  • coderdave
    coderdave over 12 years
    I am most certainly compiling at -02 so it would seem that visual studio's code generation is bad. As I do more research it seems this is the consensus and most people don't use VC for SSE and just use assembly or another compiler
  • Stephen Canon
    Stephen Canon over 12 years
    @coderdave: Please file a bug against VS, then. The only way that MS will know that they should devote resources to the problem is if people complain about it.
  • Paul R
    Paul R over 12 years
    @coderdave: yes Visual Studio generates pretty bad SSE code - it's also a pain to use for SSE as it has all sorts of stupid ABI restrictions and other annoyances - use gcc or better yet ICC if you can
  • FrankH.
    FrankH. over 12 years
    While SSE doesn't do immediate loads (except for tricks like pxor for zeros or pcmpeq for ones) it does loads from memory, _mm_load_ps(), so there's nothing wrong with creating an array on the stack and loading the SSE register from there.