AVX2 what is the most efficient way to pack left based on a mask?

c++ vectorization sse simd avx2

17,169

Solution 1

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.

Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).

AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)

For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)

For 16-bit elements, you might be able to adapt this with 128-bit vectors.

For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.

The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.

Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])

See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.

Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.

By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)

We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.

See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.

#include <stdint.h>
#include <immintrin.h>

// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).

    # clang 3.7.1 -std=gnu++14 -O3 -march=haswell
    mov     eax, edi                   # just to zero extend: goes away when inlining
    movabs  rcx, 72340172838076673     # The constants are hoisted after inlining into a loop
    pdep    rax, rax, rcx              # ABC       -> 0000000A0000000B....
    imul    rax, rax, 255              # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
    movabs  rcx, 506097522914230528
    pext    rax, rcx, rax
    vmovq   xmm1, rax
    vpmovzxbd       ymm1, xmm1         # 3c latency since this is lane-crossing
    vpermps ymm0, ymm1, ymm0
    ret

(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)

So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.

This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.

e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.

clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).

gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.

Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.

If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pexp.

You can unpack LUT entries with the same integer sequence I used, but @Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

Solution 2

See my other answer for AVX2+BMI2 with no LUT.

Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:

VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.

The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".

To figure out how far to advance your pointer for the next vector, popcnt the mask.

Let's say you want to filter out everything but values >= 0 from an array:

#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
    const float *endp = src+len;
    float *dst_start = dst;
    do {
        __m512      sv  = _mm512_loadu_ps(src);
        __mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ);  // true for src >= 0.0, false for unordered and src < 0.0
        _mm512_mask_compressstoreu_ps(dst, keep, sv);   // clang is missing this intrinsic, which can't be emulated with a separate store

        src += 16;
        dst += _mm_popcnt_u64(keep);   // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
    } while (src < endp);
    return dst - dst_start;
}

This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):

 # Output from gcc6.1, with -O3 -march=haswell -mavx512f.  Same with other gcc versions
    lea     rcx, [rsi+rdx*4]             # endp
    mov     rax, rdi
    vpxord  zmm1, zmm1, zmm1             # vpxor  xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
    vmovups zmm0, ZMMWORD PTR [rsi]
    add     rsi, 64
    vcmpps  k1, zmm0, zmm1, 29           # AVX512 compares have mask regs as a destination
    kmovw   edx, k1                      # There are some insns to add/or/and mask regs, but not popcnt
    movzx   edx, dx                      # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
    vcompressps     ZMMWORD PTR [rax]{k1}, zmm0
    popcnt  rdx, rdx
    ## movsx   rdx, edx         # with _popcnt_u32, gcc is dumb.  No casting can get gcc to do anything but sign-extend.  You'd expect (unsigned) would mov to zero-extend, but no.
    lea     rax, [rax+rdx*4]             # dst += ...
    cmp     rcx, rsi
    ja      .L2

    sub     rax, rdi
    sar     rax, 2                       # address math -> element count
    ret

Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake

In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).

@ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)

I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.

Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.

It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.

Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.

A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.

This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

Solution 3

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).

I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.

Here is the intrinsics version, along with code to build LUT.

//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
    u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
    __m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT

   // __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));

    //now shift it right to get 3 bits at bottom
    //__m256i shufmask = _mm256_srli_epi32(m, 29);

    //Simplified version suggested by wim
    //shift each lane so desired 3 bits are a bottom
    //There is leftover data in the lane, but _mm256_permutevar8x32_ps  only examines the first 3 bits so this is ok
    __m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
    return shufmask;
}

u32 get_nth_bits(int a) {
    u32 out = 0;
    int c = 0;
    for (int i = 0; i < 8; ++i) {
        auto set = (a >> i) & 1;
        if (set) {
            out |= (i << (c * 3));
            c++;
        }
    }
    return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];

void BuildPackMask() {
    for (int i = 0; i < 256; ++i) {
        *reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
    }
}

Here is the assembly generated by MSVC:

  lea ecx, DWORD PTR [rcx+rcx*2]
  lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
  vpbroadcastd ymm0, DWORD PTR [rcx+rax]
  vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm@00000015000000120000000f0000000c00000009000000060000000300000000

Solution 4

Will add more information to a great answer from @PeterCordes : https://stackoverflow.com/a/36951611/5021064.

I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.

I ended up with two meaningful variations on the proposed solution:

__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32

When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.

Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t

NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.

Implementations of @PeterCordes answer for both cases

Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.

Mask for _mm_shuffle_epi8

Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f

Example of spreading the indexes. Let's say 7th and 6th elements are picked. It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.

Complete mask code:

// helper namespace
namespace _compress_mask {

// mmask - result of `_mm_movemask_epi8`, 
// `uint16_t` - there are at most 16 bits with values for __m128i. 
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
    const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;

    const std::uint8_t offset = 
        static_cast<std::uint8_t>(_mm_popcnt_u32(mmask));  // To compute how many elements were selected

    const std::uint64_t compressed_idxes = 
        _pext_u64(0xfedcba9876543210, mmask_expanded); // Do the @PeterCordes answer

    const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
    const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte);         // From bytes to shorts over the whole register
    const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4);             // x << 4
    const __m128i combined = _mm_or_si128(shift_by_4, as_16bit);        // | x
    const __m128i filter = _mm_set1_epi16(0x0f0f);                      // 0x0f0f
    const __m128i res = _mm_and_si128(combined, filter);                // & 0x0f0f

    return {res, offset};
}

}  // namespace _compress_mask

template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
     auto res = _compress_mask::mask128(mmask);
     res.second /= sizeof(T);  // bit count to element count
     return res;
}

Mask for _mm256_permutevar8x32_epi32

This is almost one for one @PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).

The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.

Complete mask code:

// helper namespace
namespace _compress_mask {

// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
    const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;

    const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask));  // To compute how many elements were selected

    const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded);  // Do the @PeterCordes answer

    // Every index was one byte => we need to make them into 4 bytes
    const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes);  // 0000|compressed indexes
    const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte);  // spread them out
    return {expanded, offset};
}

}  // namespace _compress_mask

template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
    static_assert(sizeof(T) >= 4);  // You cannot permute shorts/chars with this.
    auto res = _compress_mask::mask256_epi32(mmask);
    res.second /= sizeof(T);  // bit count to element count
    return res;
}

Benchmarks

Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark

Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline

Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.

Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.

40 bytes worth of data, 40 chars

For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.

1000 bytes worth of data, 1000 chars

Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.

1000 bytes worth of data, 500 shorts

Similar results for shorts except with a much smaller gain - up to 2 times. I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.

1000 bytes worth of data, 250 ints

For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).

10'000 bytes worth of data, 10'000 chars

The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.

Same plots, only simd versions:

Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).

10'000 bytes worth of data, 5'000 shorts

From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.

10'000 bytes worth of data, 2'500 ints

Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.

Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512

PS:

On percentage of elements to remove

On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.

Code alignment impact

The question is: how much worth it is, if the code happens to be poorly aligned (generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).

Code alignment impact - non-simd

Chars:

From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).

Shorts:

For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected. Other then that - from 10% to 1.6-1.8 times worth

Ints:

Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.

Code alignment impact - simd versions

Not showing shorts and ints 128, since it's almost the same assembly as for chars

Chars - 128 bit register About 1.2 times slower

Chars - 256 bit register About 1.1 - 1.24 times slower

Ints - 256 bit register 1.25 - 1.35 times slower

We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

Solution 5

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.

Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.

static inline __m128 LeftPack_SSE2(__m128 val, int mask)  {
  switch(mask) {
  case  0:
  case  1: return val;
  case  2: return _mm_shuffle_ps(val,val,0x01);
  case  3: return val;
  case  4: return _mm_shuffle_ps(val,val,0x02);
  case  5: return _mm_shuffle_ps(val,val,0x08);
  case  6: return _mm_shuffle_ps(val,val,0x09);
  case  7: return val;
  case  8: return _mm_shuffle_ps(val,val,0x03);
  case  9: return _mm_shuffle_ps(val,val,0x0c);
  case 10: return _mm_shuffle_ps(val,val,0x0d);
  case 11: return _mm_shuffle_ps(val,val,0x34);
  case 12: return _mm_shuffle_ps(val,val,0x0e);
  case 13: return _mm_shuffle_ps(val,val,0x38);
  case 14: return _mm_shuffle_ps(val,val,0x39);
  case 15: return val;
  }
}

__m128 foo(__m128 val, __m128 maskv) {
  int mask = _mm_movemask_ps(maskv);
  return LeftPack_SSE2(val, mask);
}

View more solutions

17,169

Author by

Froglegs

Updated on June 27, 2022

Comments

Froglegs almost 2 years
If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?

I've seen in SSE where it was done like this: (From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
```
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
 // Move 4 sign bits of mask to 4-bit integer value.
 int mask = _mm_movemask_ps(mask);
 // Select shuffle control data
 __m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
 // Permute to move valid values to front of SIMD register
 __m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
 return packed;
}
```
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).

I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.

I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..

Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?

Here is an illustration of the Left Packing Problem from the above document:

Thanks
Froglegs about 8 years

that makes it much easier, avx512 looks very nice!
Peter Cordes about 8 years

It looks like you're inventing your own function names for a lot of stuff. e.g. is shift_right_zero_extend_v supposed to be _mm_srlv_epi32? I don't really like Intel's intrinsic names (too much typing, their asm mnemonics are much nicer), but it's clearer. Using auto doesn't even make it clear that the result is a vector, esp. since m & 7 is taking advantage of GNU C vector syntax where a scalar can be implicitly broadcast when used with a vector.
Peter Cordes about 8 years

Anyway, nice job with using a smaller LUT, I think, but the code might as well be pseudo-code. (Or maybe it's intended that way?) Also, you don't show the popcnt of the mask, and this functions doesn't return the movmskps result. So I guess the caller will also have to movemask, and then it's up to the compiler to CSE that away? Poor design IMO: have this function take an unsigned int mask.
Froglegs about 8 years

It is real C++ code that compiles actually-- I don't use the intrinsics manually, I have my own wrappers. This was just a test function, for usage I'd have to take the movemask out of the body of the function, you are correct. The m & 7 bit broadcasts 7 to a register and calls _mm256_and_si256.
Peter Cordes about 8 years

My point was that writing it the boring / annoying way with Intel's really long function names will make it a better answer, since it makes it clearer exactly what steps are taken. I think your LUT has shuffle masks packed into 3 bytes. And you decompress with pmovzx or something, then vpsrlv, then mask away high garbage in each element? Or are broadcasting one 32b element and then using a variable shift to extract eight 3b elements? I think the latter. Feel free to copy/paste my text description of what you do.
Froglegs about 8 years

Ya, perhaps I should post it with raw intrinsics then, I'll convert it over and post it again. I can post the table gen code also
Froglegs about 8 years

I posted the raw intrinsics code and LUT gen code. Yeah I broadcast 1 32bit integer, but only use the lower 24 bits of it. Each 3 bits contains the index to load from(0-7).
Peter Cordes about 8 years

Your packed LUT entries gave me an idea for generating them on the fly, and thus avoiding the LUT. See my 2nd answer that should work well on existing CPUs with AVX2. It has fairly poor latency, but good throughput.
EOF about 8 years

You can get rid of the sign-extension, by using _mm_popcnt_u64().
Peter Cordes about 8 years

@EOF: Good point. Hopefully AMD's AVX2 CPUs will run 64bit popcnt at the same speed at 32bit popcnt (unlike Steamroller where it has one per 4c throughput instead of one per 2c). Agner's tables don't report popcnt_u64 being slower on other CPUs, just Bulldozer-family.
Froglegs about 8 years

I tested it on haswell and it works, nice job! The only issue is that for some reason on MSVC _pdep_u64 and _mm_cvtsi64_si128 are only available if compiling for x64. They get defined out in 32bit builds.
Froglegs about 8 years

I dunno why, but msvc also doesn't provide popcnt_u64 unless compiling in x64(disabled in 32 bit builds)
Peter Cordes about 8 years

@Froglegs: of course it doesn't. It's a 64bit integer instruction! Stop wasting your time with 32bit builds, IMO.
Peter Cordes about 8 years

@Froglegs: woot! I love it when code works the first time after getting it to compile. How did the performance compare with your LUT version? In theory the throughput should be great, but in practice CPUs are complicated and can trip on unexpected bottlenecks that uop-counting doesn't find. And like I said for popcnt, of course this doesn't work in 32bit. If you need a 32bit version, unpack with vmovd / vpbroadcastd / vpsrlvd / vpand (i.e. your LUT-unpack code, but starting with an integer register instead of a memory value).
Froglegs about 8 years

but I can use 64 bit integers in 32 bit builds, so I don't get why it is disabled, ya I should probably use 64 bit builds
Peter Cordes about 8 years

@Froglegs: look at the asm for adding two int64_ts in 32bit code (gcc -m32): it's two 32bit instructions, add and adc. If you used std::bitset<64>::count() in 32bit code, you'd get a 32bit popcnt of each half, and an add. But _popcnt_u64 is an intrinsic specifically for the popcnt r64, r/m64 instruction, not for population-count by any method available. If you're not using gcc, you should use _popcnt_u32, but you should still make 64bit builds so your code runs faster!
Froglegs about 8 years

Hey Peter, I put together a little performance test, the performance of your method and my LUT are very similar, with the LUT slightly faster. Your method 64x(_pdep_u64) vs x86(using vector shuffle), the x64 approach is about just a hair faster
Froglegs about 8 years

I like your method since it doesn't require wasting L1 memory on a LUT ;) My little performance test wasn't really realistic since all it was doing was calculating shuffle masks, real code I'd probably prefer your method.
Froglegs about 8 years

Use this code when targeting x86: __m256i indices = _mm256_set1_epi32(wanted_indices); __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8)); __m256i shufmask = _mm256_srli_epi32(m, 29);
Peter Cordes about 8 years

@Froglegs: good idea using 2 shifts instead of shift+and to knock off the garbage bits. On Haswell, srli runs on port0 only, so it doesn't compete with pext/pdep or the shuffle. And yeah, set1 is the right approach to broadcasting into a vector.
Peter Cordes about 8 years

@Froglegs: talking about 64bit pdep got me thinking: maybe we can use 64b constants and skip the unpack from 3b to 8b. It worked; I saved an insn. Now the 64bit version should have a more significant advantage over 32bit. I also updated the code in the godbolt link significantly, so it works for -m32, using fallback code for that. (If you actually use it that way, now you need to unit-test separately in 32bit and 64bit.)
Froglegs about 8 years

I tested it out, runs about the same speed, but 1 less instruction is good:) MSVC produces the same output as clang, except it uses mov instead of movabs
Peter Cordes about 8 years

@Froglegs: It's the same instruction. The GNU toolchain uses movabs for mov r64, imm64 and mov [abs64], rax even in Intel-syntax mode, to distinguish them from mov r/m64, imm32, but other assemblers/disassemblers don't. If you're bottlenecked on memory, you won't measure a difference, but it does reduce the latency of the sequence from 19 to 16 cycles on Haswell. (or 18 to 15, depending on how the compiler implements the multiply by 0xFF). And of course fewer insns / uops helps throughput.
Z boson about 8 years

Congats on getting this right without having the hardware. I am suprised you have not received more than two (from the OP and me) votes. I added an answer using an instruction LUT. What do you think of this solution? Maybe it's a bad idea.
Peter Cordes about 8 years

If you're going to branch on the mask, you might as well hard-code the popcnt in each case. Return it in an int * parameter or something. (popcnt came after pshufb, so if you have to fall back to an SSE2 version, you don't have hardware popcnt either.) If SSSE3 pshufb is available, a (data) LUT of shuffle masks may be better if the data is unpredictable.
Peter Cordes about 8 years

Since the pshufb masks have a known relationship inside each group of 4B, they could be compressed from [ D+3 D+2 D+1 D | C+3 ... ] down to just 4B [ D C B A ], and unpacked with punpcklbw same,same / punpcklwd same,same / paddb x, [ 3 2 1 0 | 3 2 1 0 | ... ]. That's 3 shuffles and an add instead of just one pshufb, though. Or unpack the mask with a pshufb, so it's 2 shuffles and a paddb. Anyway, that makes the LUT only 16 * 4B = 64B = one cache line, at the cost of needing two other 16B constants in registers, or as memory operands.
Peter Cordes about 8 years

@Zboson: re: votes. I know, right? The question got a surprising number of votes, so I expected more people would be interested in these answers.
Z boson about 8 years

@PeterCordes, good point about popcnt. I thought of that but the OP did not request it even though it seems like something the OP would need. When I look at the jump table in GCC I can't make out yet how GCC created it. I would expect it jump in the order I listed the codes 0,1,2,3...15 (or maybe reverse order) but it seems random. It's 0xe, 0x34, 0x1, 0x38, 0x2, 0x8,0x9, 0x39, 0x3, 0xc, oxd. If you look at the codes above you can see that that's not at all in order. Do you know why this is?
Z boson about 8 years

@PeterCordes, well your answer is I think a bit hard to follow but I think that has more to do with your x86 knowledge being well beyond mine and probably most other people's now than anything else. This makes it harder to parse to determine if it's correct (which is why I asked you if you tested it). But the OP seems happy and that's very important. You can probably pick up an haswell intel NUC for pretty cheap now on ebay or something (broadwell and skylake are exceedingly boring so I will hold off longer).
Peter Cordes about 8 years

If you do understand the data movement in my answer now, can you think of a better way to introduce it? Maybe step by step with ASCII diagrams? How readable do you find the comments in the code?
Peter Cordes about 8 years

@Zboson: I'm shopping for a Skylake system. The built-in GPU is supposed to be significantly better than Haswell. NUCs don't have room for my RAID array :/ Also, 32GB of DDR4 is not very expensive :) I wish they'd make a desktop Skylake with eDRAM, though. Also, my desktop setup needs 2 video outputs (at least one being DVI), and S/PDIF audio output. Also I'd like to be able to throw in a PCIe video card for games if I want. I have a perfectly good Radeon 6870 that idles at about 80 Watts with the open source drivers ...
Peter Cordes about 8 years

Maybe it started to order it for a decision-tree of branches before deciding on a jump-table strategy. It amuses me that when making PIC code, it decided on a table of 4B displacements that it loads with movsx. If it's going to movsx anyway, might as well use 1B displacements for a smaller table. It also doesn't know that the input will always be 0..15, so it checks for outside that range and returns zero :/
Froglegs about 8 years

The part I do not understand is how _pext_u32/64(identity_indices, expanded_mask) manages to produce the desired indices--
Froglegs about 8 years

I don't know why more people don't like it, seems pretty useful, but maybe i'm biased--
Peter Cordes about 8 years

@Froglegs: Thanks for the feedback on which step wasn't obvious. I had only linked the wikipedia article which had a diagram of pext in operation. Does this update help? I worked through an example of how pext grabs bits and packs them to the low bits of the output.
Froglegs about 8 years

Ah I was not looking closely enough, I didn't notice the expanded_mask was the selector, had it reversed in my mind. Makes more sense now--!
Peter Cordes about 8 years

@Froglegs: yeah, things get tricky when you use one shuffle to create a control-mask for another shuffle. When I first wrote the answer, I just gave up and let the verbose variable-names in the code speak for themselves, because it was getting pretty hairy to explain in words. (Other shuffles where both inputs are variable also get used both ways: pshufb can be used as a 4bit LUT with variable shuffle mask and constant data being shuffled. Or it can be used as a shuffle with constant control inputs)
Z boson about 8 years

@PeterCordes, what is the .L4[0+rax*8]] .L4: .quad .L1 stuff? Is there someway to get GCC to ouput the hex values rather than decimal values? I used objdump instead of GCC -S because objdump shows the hex values.
Z boson about 8 years

@PeterCordes, maybe this link is useful to you.
Z boson about 8 years

I find the comments in your assembly code very useful.
Peter Cordes about 8 years

.L4: is the table. .quad .L1 is the first entry, as a 64bit pointer. [.L4 + rax*8] is an addressing mode that indexes the table with rax. (In AT&T syntax, .L4(%rax), which is probably why gcc's Intel syntax likes to put labels outside the []. Or maybe it's just copying MASM like it does for other aspects of the syntax.)
Peter Cordes about 8 years

Thanks for the link, but it has some massive flaws. It ends up telling you that the correct way to increment a memory operand is to use it as an input operand with a "memory" clobber. (The actual correct way is as a "+m" read-write output operand, without the optimization-defeating "memory" clobber. And he should have been using a "+rm" constraint, not just one or the other.) Also, the asm doesn't always match the code: it looked like the asm just returned the value, instead of passing it to printf. That's minor, though.
Peter Cordes about 8 years

re: hex: you mean like this Godbolt feature-request? Having gcc do it internally would probably be ideal, maybe submitting a patch to gcc would be better than having godbolt post-process the output. Esp. because it would be useful outside of godbolt.org!
Z boson about 8 years

@PeterCordes, to be honest, I only read the link for about 2 minutes. It was in the top 30 links on news.ycombinator.com yesterday so I thought it might be good. Here is the discussion.
Christoph Diegelmann over 7 years

In filter_non_negative_avx2 and filter_non_negative_avx512 you src += 16;. I think the avx2 version should only add 8. In this special case we could also drop the compare if we don't care about negative zero (movemask extracts the sign bit). Please correct me if I'm wrong.
Peter Cordes over 7 years

@Christoph: Good catch on the loop increment, thanks. I probably copied it from the avx512 version. Also a good point about sign bits. If you're willing to consider -0.0 as negative, you can even do stuff like XOR floats together and then then use that as the control for a variable blend to do something based on whether their signs are equal or not (e.g. while checking if a point is inside a bounding box), instead of multiplying them or doing two compares. But note that NaNs can have the sign-bit set or not, so this isn't safe wrt to NaNs.
wim over 7 years

@PeterCordes , Very nice answer! I think for the Skylake platform a tiny improvement is possible if the _mm256_movemask_ps is replaced by _mm256_movemask_epi8. Then we don't need the (slow) pdep and imul instructions, because in that case we already have a 8x4-bit mask which is suitable for the 32 bits pdep instruction with identity_indices = 0x76543210 Instead, we need a vpsrlvd and a vpand at the end, but these have better latency and throughput on Skylake than pdep and imul.
wim over 7 years

As far as I can see there is no bypass delay for both vpmovmskb and vmovmskps, at least on Skylake. Another minor advantage is that this code should run in both 32 bit and 64 bit mode.
wim over 7 years

Code: __m256 compress256_v3(__m256 src, __m256 mask256){unsigned int mask = _mm256_movemask_epi8(_mm256_castps_si256(mask256)); const uint32_t identity_indices = 0x76543210; uint32_t wanted_indices = _pext_u32(identity_indices, mask); __m256i fourbitvec = _mm256_set1_epi32(wanted_indices); __m256i shufmask = _mm256_srlv_epi32(fourbitvec,_mm256_set_epi32(28,24,20,16,12‌,8,4,0)); shufmask = _mm256_and_si256(shufmask,_mm256_set1_epi32(0x0F)); return _mm256_permutevar8x32_ps(src, shufmask);} . Godbold link
Christoph Diegelmann about 7 years

@wim Using _mm256_movemask_epi8 we can get rid of the multiply in @PeterCordes orginal version. First we _pext_u32 to get wanted_indices and then _pdep_u64 with 0x0F0F0F0F0F0F0F0F to scale them from nybbles to bytes. Godbold link
wim about 7 years

@Christoph : Your code is nice and a bit shorter, but my code might be slightly faster. Aside from loading the constands, which are likely to be hoisted outside the loop after inling, your code uses vmovq+vpmovzxbd instead of vmovd+vpbroadcastd in my code, which is equally efficient. However, instead of pdep in your code, with latency 3 and throughput 1 on Skylake, my code uses vpsrlvd+vpand, which both have latency 1 and throughput 1/2. Moreover, your code only works in 64 bit mode.
wim about 7 years

@Christoph : Correction: On Skylake vpand has latency 1 and throughput 1/3. Note that vpsrlvd is very slow on Haswell: latency 2 and throughput 2. Therefore, on Haswell your solution will be faster.
Peter Cordes about 7 years

@wim: AMD's new Zen I think still has 128b vector execution units (so 256b ops have half throughput). Doing more in scalar integer will be a win there, if pdep is fast on Zen. (It is supported, but I don't think there are latency numbers yet). I think overall throughput should be more important than latency here, since the loop-carried dependency is only on popcnt and its input. Thanks for the vpmovmskb idea; I'll update my answer with that sometime. (Or feel free to add a paragraph and a godbolt link to the answer yourself; I might not get back to this very soon).
wim about 7 years

@PeterCordes : This webpage lists latency and throughput numbers for the AMD Ryzen/Zen CPU. The numbers are quite interesting. For example: The latency and throughput of the vpand instruction with ymm (256 bit) operands is 1c and 0.5c, which is quite amazing for a processor without 256 bit execution units, I think. On the the other hand, the pext and pdep instructions both have L=18c and T=18c.... The vpsrlvd instruction: L=T=4c.
wim about 7 years

Probably they focused mainly on optimizing the most common instructions, when they were developing the micro architecture.
Peter Cordes about 7 years

@wim: Thanks for the link. Yeah, they must be microcoding pext/pdep instead of including a big hardware execution unit. Looks like Zen has pretty impressive vector throughput for a lot of 256b integer AVX2 instructions. it apparently has two 128b shuffle units, so it can do two PSHUFB xmm per clock or one PSHUFB ymm. It does really well on x265 benchmarks, so apparently the micro-op cache and 6-wide issue is enough to handle splitting 256b ops into two. Unfortunately, it looks much weaker at cross-lane shuffles. vpermps is 5c lat and 4c tput, which is bad for this algo.
Peter Cordes about 7 years

Zen still has slow bsr/bsf, but fast lzcnt/tzcnt. :( And slow shld/shrd. They did speed up integer multiply, though, to be a lot more like Intel SnB-family (3c latency, 1c throughput for the forms that don't produce a high-half result, even 64-bit. Big speedup over Bulldozer/Piledriver/Steamroller's 6c lat, 4c throughput for 64-bit imul). Overall very impressive (especially given the starting-point), and apparently the cache hierarchy doesn't suck.
wim about 7 years

Yes, they have made a big step forward! Some instruction even have better timings than on Skylake,
Christoph Diegelmann almost 7 years

Due to the very slow pdepand pext on ryzen (18 cycles each) @Froglegs answer is probably the better option if we're unsure about the target CPU. Maybe we should add this as a note to either @Froglegs or your answer.
wim about 5 years

@PeterCordes: The vpand in my earlier comment can be omitted, because only the 3 least significant bits of shufmask are relevant for vpermps. This leads to more efficient code on Intel Skylake. Or with clang.
Peter Cordes about 5 years

@wim: I haven't gotten back to this answer to really look at the different options people have suggested with variable shifts instead of pdep, and so on. It would probably be a good idea to post an answer of your own (or feel free to edit in an AMD-friendly section in this answer, and add a caveat about pdep on AMD to mine.) I may get around to that at some point, but I have a chromium window with probably 30 tabs of answers I've been meaning to finish writing or get back to update...
wim about 5 years

@Froglegs: I think you can use a single _mm256_srlv_epi32 instead of _mm256_sllv_epi32, and _mm256_srli_epi32, since you only need the 3 bits (per element) at the right position, because _mm256_permutevar8x32_ps doesn't care about garbage in the upper 29 bits.
Froglegs about 5 years

hi wim, thanks for the tip. You are correct that only the lower 3 bits matter, I've updated the post so it shows your suggestion.
wim about 5 years

@Zboson: Note that since gcc 8.1 it is a good idea to add a default: __builtin_unreachable(); in the switch. This leads to slightly more efficient code, with one cmp/ja less than without the default case.
ZachB about 5 years

Your AVX2 version benchmarks slightly (~3%) faster than this version on CSL with GCC8.2. Impressive work there. (The AVX2 version also runs ~4.52x faster than the SSE2 LUT version.)
Peter Cordes about 5 years

@ZachB: Cool. There are some suggestions in comments for alternatives with variable shifts that might be worth trying, I forget if anyone came up with anything that was likely better on Intel as well as AMD (where pdep is slow.) Or did you mean to say AVX512, since you're commenting under this answer and getting ~4x the SSE2 version?
ZachB about 5 years

@PeterCordes as far as I understood the comments on the AVX2 version, they all rely on the mask starting out expanded (__m256) instead of being a compressed bit mask (int). I'm starting from the latter and didn't see a way to adapt those suggestions. Not sure what you're asking in last sentence; I tried this AVX512 version on both SKL and CSL but don't have perf counters on either. I was hoping VCOMPRESSPS would magically be faster esp. since it's wider. Agner only has timings for that on KNL.
Peter Cordes about 5 years

@ZachB: Ok, I'm surprised at a 4x speedup only doubling the vector width. Agner has SKX numbers, in the spreadsheet tab next to SKL. (Skylake-X and Cascade-Lake X CLX are the server versions with AVX512. SKL often means specifically the client chips without AVX512). vcompressps with a vector destination is 2 uops for port 5 so there's a bit of a bottleneck. Presumably as a store it's 3 or 4 uops.
Peter Cordes about 5 years

Probably worth trying AVX512 with 256-bit vectors; using 512-bit vectors disables the vector ALUs on port 1, and reduces max turbo significantly on server chips. Unless you already use 512-bit vectors before and after. Or maybe you run into a memory bottleneck. I would have expected vcompressps to be faster than imul + pdep + pext (3 uops for port 1 = bottleneck) as well as 2x shuffles. Or maybe the same speed since it's also 2 port-5 uops.
ZachB about 5 years

Sorry for the unclear comments. On SKL your AVX2 pdep/pext/shuf is ~4.5x faster than @ZBoson's SSE2 LUT version. On SKX and CLX this 512-bit vcompressps version was ~3% slower than pdep/pext/shuf run on the same chips. Since the pdep/pext/shuf version was slightly faster, I think that means it's not mem-bottlenecked. I don't have PMU access on SKX/CLX tho. On CLX, 256-bit vcompressps is ~10% faster than 512-bit vcompressps; ~6% faster than pdep/pex/shuf.
ZachB about 5 years

I still can't find a vcompressps entry in Agner's tables for anything but KNL btw. Are you assuming vPcompress has same timings as vcompress?
Peter Cordes about 5 years

@ZachB: are you sure you have the latest version? I'm looking at the .ods spreadsheet from his 2018-sep .zip. The row is VCOMPRESPS/PD v{k},v 2 2 p5 3 2 between the VMASKMOVPS/Dand VEXPANDPS/PD rows, in the SKX tab (not SKL). Between SKL and Pentium 4.
ZachB about 5 years

@PeterCordes gah, Agner's table has VCOMPRESPS/PD with only one S. The Intel intrinsics guide and arch manual has vcompressps with two Ss. /facepalm.
Peter Cordes about 5 years

@ZachB: Oh crap, I didn't even notice until you pointed it out! I happened to be searching on vcompr to find it. I randomly decided that was a good probably-unique prefix to find the entry in LibreOffice search.
Peter Cordes about 5 years

@ZachB: I sent Agner a message about that mistake via his blog (agner.org/optimize/blog/read.php?i=962), so it should be fixed in the next revision of the tables. uops.info/html-lat/SKX/… has SKX latency from vector to result (3c) and from mask to result (6c), as well as actual measurements + IACA output in their table. Memory-destination vcompressps is 4 uops like I guessed, no micro-fusion of the store.
ZachB about 5 years

When this is used in a loop, one sorta nifty way to do the tail would be with vmaskmovps. Instead of making an 8x256b LUT for the mask, you could make it on the fly with uint64_t popc = _mm_popcnt_u64(mask); uint32_t sm1 = _bzhi_u32(-1, popc); __m256i sm2 = _mm256_set1_epi32(sm1); sm2 = _mm256_sllv_epi32(sm2, <vec of 24,25,...,31>); and then _mm256_maskstore_ps(dest, sm2, packed);. Maybe more efficient to just loop backwards with full 256b stores for the last few vecs, (overlapping work) but sort of a hassle to calc those indexes.
Peter Cordes about 5 years

@ZachB: you don't need an 8x 32-byte LUT, you only need 60-byte array and take a sliding window. (Or compressed to 15 bytes with pmovsx loads). Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. But yes, you can calculate it instead of loading it with variable shifts, because variable shifts saturate the count.
Peter Cordes about 5 years

@ZachB: You only need this at all if your destination buffer doesn't have enough padding at the end, e.g. because it was allocated based on a popcnt of the mask array without leaving an extra 32 bytes at the end. Depending on allocator support, you can realloc it down after allocating an array as large as might possibly be needed, and count on the fly. But IDK if there is an aligned_realloc. (Hmm, or count on the fly as you calculate the mask bitmaps, although doing that efficiently could be tricky.)
ZachB about 5 years

Good idea on pmovsx. Even if dest is % 32 in len, compress256 writes garbage to the dont-care bytes, so another use case would be keeping the end untouched. (Darwin's allocator is aggressive about moving realloced mem when shrinking btw.)
Peter Cordes about 5 years

@ZachB: the whole point of don't-care bytes as padding is that you are allowed to step on them with stores :P But sure, if you specifically want zero padding that you can count on for horizontal sums or whatever, you could maybe do one masked store to re-zero the padding. (And BTW, you potentially need dest length to be popcount(mask) + 32, not for the buffer to end at an alignment boundary. The final 32-byte store could be from an all-zeros bitmask mask, so all 32 bytes are past the true end regardless of how that's aligned.)
Peter Cordes about 5 years

I was going to comment the same thing as wim, but your intrinsics version already has the optimization. Looks like you forgot to update the asm, which is what I was looking at. (godbolt.org has MSVC installed). With a different choice of types, you can probably also avoid the movsxd sign-extension. (It may go away after inlining, though.)
Froglegs about 5 years

I updated the asm-- also as Peter suggested, by making the movemask unsigned the movsxd went away.
Peter Cordes about 5 years

@ZachB: I think some of the AVX2 suggestions for using variable-shifts do work for mask bitmaps, not vector compare masks. You can go from bitmap to vector cheaply with a broadcast + variable shift, e.g. _mm256_set1_epi32(mask[i]) and then variable-shift to put the appropriate bit as the high bit of each element. Or with AVX512, vpmovm2d. But then you need each chunk of the mask in a k register, and loads into k registers are expensive. Cheaper to broadcast-load 32 bits of mask and then shift multiple ways.
Peter Cordes about 5 years

@ZachB: I'm also surprised that vcompressps isn't doing better. Maybe the compiler is doing narrow loads into k registers, or not unrolling, or something. Agner measured kmovw k1, [rdi] at 3 uops on SKX, and that's what IACA says (load + p0156 + p5), but uops.info/table.html says it's only 2 uops. (load + p5). But since we need to popcnt as well, I guess we'd want to movzx eax, word [rdi] load into an integer register and popcnt, and kmovw k1, eax. vcompressps should go at 1 vector per 3 clocks, bottlenecked on port 5, including 2p5 for it + p5 for kmov.
ZachB about 5 years

@PeterCordes oh, good idea -- I'm actually using that broadcast+variable shift technique to make the mask for vmaskmovps in the last iterations, didn't think about applying it to the earlier comments. -- On vcompressps, I'm using 256b ops b/c it's marginally faster than 512b; so movzx eax, byte [rdi] , kmovb k1, eax. godbolt.org/z/BUw7XL is the fastest I've got to for AVX2 and AVX512. Unrolling 2x or 4x hasn't helped with AVX2, remains bottlenecked on p1 and p5. Don't have PMU access on CLX/SKX but no measurable time difference there either.
Peter Cordes about 5 years

@ZachB: I wonder if misaligned stores are hurting the 512-bit version. It probably generates a mask and uses it for a masked 256 or 512-bit store, and if that crosses a cache line boundary then it has to get processed separately. And perhaps misaligned 512-bit stores are worse than cache-line-split 256-bit stores for some reason; we already know that 512-bit vectorization of other things seems to be more alignment sensitive than we might expect just from every access being a cache-line split (e.g. when you'd expect a memory bottleneck to hide it) so maybe there's something going on there.
Peter Cordes about 5 years

@ZachB: If you're finding that AVX2-only pdep is faster than vcompressps but you're still targeting AVX512VL, you could use vcompressps for the cleanup to keep it simple. And BTW, see is there an inverse instruction to the movemask instruction in intel avx2? for more ideas about bitmap -> vector options.
ZachB about 5 years

@PeterCordes if it's slowed down heavily by unaligned stores, I wonder if using vpermt2ps to build up full, 64B-aligned vectors before storing to mem would help.
Peter Cordes about 5 years

@ZachB. Neat idea, yeah that may be possible. It will create an arbitrarily long dependency chain of shuffles until you have a full vector. Branching once you have a full vector would break the dep chain but could mispredict. Probably anything branchless (re-storing the same vector and using cmov to update the dst pointer) would create a serial dependency through a chain of vector shuffles for the entire array. But only a 1 shuffle per input vector, so 3 cycle latency; preparing a control vector is independent per iteration
Peter Cordes about 5 years

But the shuffle control you build requires accounting for the current length of the vector you're building, not just how many new elements are in the current source vector. And you have to deal with left-over elements that didn't fit into a full vector when you do complete one. I think that's going to be a problem. Maybe we can branch on it, but doing it branchlessly probably introduces another at least 1 cycle to blend into the "accumulator", or 3 cycles to shuffle into it with another vpermt2ps with a control vec that either keeps it or packs the leftovers.
ZachB about 5 years

@PeterCordes arghh, I was dumb and benching with arrays that hit RAM. New numbers on CLX where src, dst and mask together fit in L2, arbitrary ops/sec: pdep/pext/shuf 200, 256b vcompressps 271, 512b vcompressps 305; so yes, good on vcompressps, 1.35 or 1.52x faster. Also tried with aligned stores (advancing dst by 32 or 64B instead of popcnt) but whatever improvement that may have was negated by greater access. When hitting RAM or L3, 512b vcompressps is consistently ~3% slower than 256b version. (Edited to fix L3 comment.)
ZachB about 5 years

In an unrelated but also very simple case (vpshufb in a loop on 64B-aligned data on CLX or SKX), the 512b version was ~28% faster than the 256b version when the data fit in L1; but 512b was 11% slower than 256b for L2, and 15% slower for L3. If it was just a matter of down-clocking from using the full 512b execution units, I would have expected L1 to suffer as well. Not sure why this happens.
Peter Cordes about 5 years

@ZachB: We know that aligned 512-bit access to L1d is efficient, and 512-bit downclocking isn't a factor of 2. Or at least that 512-bit loads can execute at 2 per clock. IDK about store throughput for commit from store buffer to L1d. I'd have expected a speedup closer to the factor of 2 from the vector width, then downclocking to bring it down to like 70 or 80% faster at worst. Were you doing a masked store or anything unusual or that could cause a loop-carried dependency? If 512-bit vectors made the CPU run more than twice as slow, they'd be mostly pointless
wim almost 5 years

@ZachB: I wouldn't be surprised if a _mm512_maskz_compress_ps followed by a full 512 bit overlapping store performs better than the masked store in _mm512_mask_compressstoreu_ps. _mm512_mask_compressstoreu_ps has high latency and two subsequent masked stores to the same cache line may hurt each other probably. Note that _mm512_maskz_compress_ps has to load data from memory, modify it with the compressed data, and store it back. So, at least overlapping stores are less micro-ops (no load needed).
wim almost 5 years

....(continued): I don't have AVX-512 hardware, but with AVX2 I found once in some particular case that overlapping stores were more efficient than masked stores.
Peter Cordes almost 5 years

@wim: _mm512_maskz_compress_ps (or vcompressps in asm) decodes to uops that don't include a load. It merges into cache not by loading, but by leaving some bytes unmodified when committing from the store buffer. (At least that's my understanding. Possibly there's a RMW cycle inside the cache write path, but if so it's invisible and effectively atomic. Unlikely though.) Skylake and later have hardware support for masked stores, because they're first-class citizens in AVX512. I wouldn't be surprised if overlapping stores are even more efficient, but you can't always do that.
wim almost 5 years

@PeterCordes: You are right, I misinterpreted the uops.info. But it would be nice to see if there is actually a performance difference between the two options.
ZachB almost 5 years

@PeterCordes nothing unusual, no loop-carried deps: godbolt.org/z/9SQYCy. (Same source on both sides, just with Vec256 or Vec512 as the template param.) Worth opening a separate question about why this could be?
ZachB almost 5 years

@wim interesting idea, but that benchmarks the same on CLX.
Peter Cordes almost 5 years

@ZachB: yeah, definitely would make an interesting new question. \@wim's idea of using vcompressps zmm{k1}, zmm and then a separate regular 512-bit is interesting, (but needs a separate cleanup loop once you get < 64 bytes from the end using vcompressps with a memory destination).
Peter Cordes almost 5 years

@wim: yes, definitely worth testing in case there's something funny, like maybe the store buffer doing better at coalescing unmasked stores (which it hopefully can if there are no other stores between. IDK if we need to avoid loads between stores; if coalescing happens post retirement then loads won't matter. vcompressps zmm{k1}, zmm and then a separate store should be fewer fused-domain uops for the front end, because a separate store can micro-fuse the store-address+store-data. So you have 2+1 instead of 4.
BeeOnRope over 4 years

It turns out that pdep/pext are worse than just "very slow" on AMD (18-19 cycles as you say). They are that, plus 8 uops and roughly 4.5 cycles for every set bit in the mask, so "comically slow". For fully set masks, they can clock in at over 500 uops and close to 300 cycles. That's kernel call territory! See e.g., here. It turns out the reported figures were apparently always with a zero mask, only the one test on uops.info with memory operands captured the full behavior by fluke.
BeeOnRope over 4 years

@PeterCordes I know, right? You'd almost almost be better off with some kind of emulation that you wrote yourself. This is a better link than the one I included above.
Denis Yaroshevskiy about 4 years

Huge thanks for this! Was able to use this to do compress for bytes (though haven't measured anything yet): github.com/DenisYaroshevskiy/algorithm_dumpster/blob/…
Denis Yaroshevskiy about 4 years

@PeterCordes - you might be curious - I did measurements for using this to remove chars, shorts and bytes. Added as an answer to this question.
Peter Cordes about 4 years

I have a wild guess about the scalar char results being so much slower than short: clang is often reckless with false dependencies when using 8-bit integers, e.g. mov al, [mem] merging into RAX instead of movzx eax, byte [mem] to zero-extend with no dependency on the old contents. Intel since Haswell or so doesn't rename AL separately from RAX (instead merging) so this false dependency can create a loop-carried dependency chain. Maybe with short it's avoiding 16-bit operand-size by using movzx or movsx loads. I haven't checked the asm yet.
Peter Cordes about 4 years

code: alignment: i7-9700k is Coffee Lake, which has a working loop buffer (LSD), unlike earlier Skylake-based microarchitectures where microcode updates disabled the LSD. So I guess the loop is too big to fit in the LSD. Except for special cases like when std::remove is just doing a linear search for any elements to keep; that tight loop presumably runs from the LSD even if clang unrolls it.
Peter Cordes about 4 years

Hmm, a mixed scalar / SIMD strategy could be good for that sparse case, using branchless SIMD to scan the next 16 or 32 bytes for a non-matching element. (vpcmpeqb / vpmovmskb / tzcnt). But that creates a dependency chain that couples into the next load address so it's potentially horrible. Hmm, maybe looping over the set bits in the mask would be better, blsr to reset the lowest set bit, tzcnt to find that offset, and scalar copy into *dst++ ...
Peter Cordes about 4 years

... With software pipelining of the outer loop, you could be loading and comparing to get the mask for the next loop before doing the current inner loop, so that work can be in flight when the loop branch in this loop-over-mask-bits mispredicts on loop exit. And you can combine masks into a 64-bit integer so you stay in that inner loop longer. So you might have one mispredict per 64 input elements, however many output elements that is. And consistent patterns might make that predictable.
Denis Yaroshevskiy about 4 years

@PeterCordes - not sure if I follow everything. 1) That's an interesting guess, I didn't realise that was a thing. Seems like clang generates movzx [byte]: gcc.godbolt.org/z/xzy_-v 2) From what I heard - LSD should kick in after 64 iterations and after that the alignment should not matter. However, I had not seen it happen (not for remove not for other algorithms). std::remove is quite small: github.com/llvm/llvm-project/blob/… Do you maybe know why alignment is still this much of an issue?
Denis Yaroshevskiy about 4 years

3) Just clarifying what you mean: mostly zeroes, one check for multiple elements and then compute indexes of elements to store by doing bit operations to compute offsets? Nit.
Peter Cordes about 4 years

The LSD can only hold 64 uops; the IDQ that feeds the issue stage can lock down those uops when it notices a loop, turning the IDQ into the loop buffer. If you look at the lsd.uops perf counter, out of the total uops_issued.any, you can see how many uops came from the loop buffer.
Peter Cordes about 4 years

3) yeah, for a case where most elements get removed, keeping only a few, I guess you'd invert the mask so the elements you wanted to keep were the 1 bits. And yeah, then you iterate mask &= mask-1 (BLSR) to loop over just the set bits. With BMI1 that has single-cycle latency as a loop-carried dependency. In each iteration, you do *dst++ = srcptr[tzcnt(mask)];. Where srcptr is the start of the 64-element chunk that mask was derived from. So the scalar work is BLSR / jnz (loop carried), and not loop-carried: TZCNT, mov load with scaled-index addressing, mov store, dst++.
Denis Yaroshevskiy about 4 years

@PeterCordes on code alignment: I ran perf for best/worst cases for remove/int/10'000. Both show 0 for LSD.UOPS. They show though a difference in mispredicted branches: 0.13% for the best case vs 1.08% for the worst case. If you can have a look, here is the output from perf (using it for the second time ever - most likely I made a mistake): gist.github.com/DenisYaroshevskiy/…
Peter Cordes about 4 years

Exactly zero lsd.uops for the whole program for all cases sounds like it hasn't actually been re-enabled in your Coffee Lake CPU, contrary to what en.wikichip.org/wiki/intel/microarchitectures/… says :( You're using perf stat -e correctly. BTW, you can set /proc/sys/kernel/perf_event_paranoid to 0 so you don't need sudo for perf to work. (Or some higher level to still stop kernel profiling but allow user-space.)
Peter Cordes about 4 years

Even without the LSD, the uop cache can often keep up with the front-end. Although with up-to-date microcode to fix the recent JCC erratum, you might see more impact for code alignment than normal if GCC isn't aligning JCC instructions to avoid legacy decode for a given 32-byte block. Hmm, I wonder if maybe Intel disabled the LSD on Coffee Lake because of that, if they had it working previously. 32-byte aligned routine does not fit the uops cache / phoronix.com/…
Denis Yaroshevskiy about 4 years

Let us continue this discussion in chat.
Dave Dopson over 3 years

You mention "I think clang failed to account for popcnt's dependency on its output register" ... yes, that's a known bug that should be fixed in later versions of clang. I've hit this before. For the curious, the false-dependency exists because for instructions like TZCNT, if the input registers is all 0's then the output register is unchanged (and thus depends on the original value of the output register). POPCNT uses the same chip circuitry and inherits the limitation. Chalk it up to poor ISA design on Intel's part.
Peter Cordes over 3 years

@DaveDopson: Yup, updated; more recent gcc and clang know about it since Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs discovered it. Why does breaking the "output dependency" of LZCNT matter? has more / better detail. It's not bad ISA design, just a microarchitectural performance "bug" which got fixed for lzcnt/tzcnt in Skylake, and for popcnt in Cannon/Ice Lake. In fact providing tzcnt/lzcnt to make it possible to avoid bsf/bsr false deps is good.
Peter Cordes over 3 years

But thanks for the reminder, updated my answer to mention newer compilers and link the canonical Q&As about it.
Peter Cordes almost 3 years

Is that faster than a lookup table of shuffle control vectors for _mm_shuffle_epi8? Like __m128i shuffles[16] = ... which you index with the _mm_movemask_ps result? If you're only doing 4 elements per vector, the lookup table is small enough to be usable and fast. I guess maybe if you only have to do this a couple times, not in a long-running loop, then spending 9 instructions per vector (with 3 of them being blendv which is multi-uop on Intel) might be ok to avoid the possibility of a cache miss on the LUT.
Peter Cordes almost 3 years

Can the _mm256_permute4x64_pd(v, 0xF9) shuffles be replaced with different shuffles of val to shorten the dependency chain a bit, making it easier for out-of-order exec to hide the latency? Or do they all need to shuffle the previous blend result?
icdae almost 3 years

I tested with a LUT, similar to Z boson's reply but with _mm_shuffle_epi8, and yes, it is significantly faster (at least in my current usage, always profile for your specific case). There will be no out-of-order execution with the final three permutations as the results rely on each previous instruction. I'm certain there should be a way to avoid, or at least reduce, the dependency chain. If I find one then I'll definitely post it.