Fastest way to do horizontal SSE vector sum (or other reduction)
Solution 1
In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).
If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.
Related Q&As for wider vectors, and integers, and FP
-
__m128
and__m128d
This answer (see below) -
__m256d
with perf analysis for Ryzen 1 vs. Intel (showing whyvextractf128
is vastly better thanvperm2f128
) Get sum of values stored in __m256d with SSE/AVX -
Intel AVX: 256-bits version of dot product for double precision floating point variables of single vectors.
-
Dot product of arrays (not just a single vector of 3 or 4 elements): do vertical mul/add or FMA into multiple accumulators, and hsum at the end. Complete AVX+FMA array dot-product example, including an efficient hsum after the loop. (For the simple sum or other reduction of an array, use that pattern but without the multiply part, e.g. add instead of fma). Do not do the horizontal work separately for each SIMD vector; do it once at the end.
How to count character occurrences using SIMD as an integer example of counting
_mm256_cmpeq_epi8
matches, again over a whole array, only hsumming at the end. (Worth special mention for doing some 8-bit accumulation then widening 8 -> 64-bit to avoid overflow without doing a full hsum at that point.)
Integer
-
__m128i
32-bit elements: this answer (see below). 64-bit elements should be obvious: only one pshufd/paddq step. -
__m128i
8-bit unsigneduint8_t
elements without wrapping/overflow:psadbw
against_mm_setzero_si128()
, then hsum the two qword halves (or 4 or 8 for wider vectors). Fastest way to horizontally sum SSE unsigned byte vector shows 128-bit with SSE2. Summing 8-bit integers in __m512i with AVX intrinsics has an AVX512 example. How to count character occurrences using SIMD has an AVX2__m256i
example.(For
int8_t
signed bytes you can XOR set1_epi8(0x80) to flip to unsigned before SAD, then subtract the bias from the final hsum; see details here, also showing an optimization for doing only 9 bytes from memory instead of 16). -
16-bit unsigned:
_mm_madd_epi16
with set1_epi16(1) is a single-uop widening horizontal add: SIMD: Accumulate Adjacent Pairs. Then proceed with a 32-bit hsum. -
__m256i
and__m512i
with 32-bit elements. Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2. For AVX512, Intel added a bunch of "reduce" inline functions (not hardware instructions) that do this for you, like_mm512_reduce_add_ps
(and pd, epi32, and epi64). Also reduce_min/max/mul/and/or. Doing it manually leads to basically the same asm. -
horizontal max (instead of add): Getting max value in a __m128i vector with SSE?
Main answer to this question: mostly float and __m128
Here are some versions tuned based on Agner Fog's microarch guide's microarch guide and instruction tables. See also the x86 tag wiki. They should be efficient on any CPU, with no major bottlenecks. (e.g. I avoided things that would help one uarch a bit but be slow on another uarch). Code-size is also minimized.
The common SSE3 / SSSE3 2x hadd
idiom is only good for code-size, not speed on any existing CPUs. There are use-cases for it (like transpose and add, see below), but a single vector isn't one of them.
I've also included an AVX version. Any kind of horizontal reduction with AVX / AVX2 should start with a vextractf128
and a "vertical" operation to reduce down to one XMM (__m128
) vector. In general for wide vectors, your best bet is to narrow in half repeatedly until you're down to a 128-bit vector, regardless of element type. (Except for 8-bit integer, then vpsadbw
as a first step if you want to hsum without overflow to wider elements.)
See the asm output from all this code on the Godbolt Compiler Explorer. See also my improvements to Agner Fog's C++ Vector Class Library horizontal_add
functions. (message board thread, and code on github). I used CPP macros to select optimal shuffles for code-size for SSE2, SSE4, and AVX, and for avoiding movdqa
when AVX isn't available.
There are tradeoffs to consider:
- code size: smaller is better for L1 I-cache reasons, and for code fetch from disk (smaller binaries). Total binary size mostly matters for compiler decisions made repeatedly all over a program. If you're bothering to hand-code something with intrinsics, it's worth spending a few code bytes if it gives any speedup for the whole program (be careful of microbenchmarks that make unrolling look good).
- uop-cache size: Often more precious than L1 I$. 4 single-uop instructions can take less space than 2
haddps
, so this is highly relevant here. - latency: Sometimes relevant
- throughput (back-end ports): usually irrelevant, horizontal sums shouldn't be the only thing in an innermost loop. Port pressure matters only as part of the whole loop that contains this.
- throughput (total front-end fused-domain uops): If surrounding code doesn't bottleneck on the same port that the hsum uses, this is a proxy for the impact of the hsum on the throughput of the whole thing.
When a horizontal add is infrequent:
CPUs with no uop-cache might favour 2x haddps
if it's very rarely used: It's slowish when it does run, but that's not often. Being only 2 instructions minimizes the impact on the surrounding code (I$ size).
CPUs with a uop-cache will probably favour something that takes fewer uops, even if it's more instructions / more x86 code-size. Total uops cache-lines used is what we want to minimize, which isn't as simple as minimizing total uops (taken branches and 32B boundaries always start a new uop cache line).
Anyway, with that said, horizontal sums come up a lot, so here's my attempt at carefully crafting some versions that compile nicely. Not benchmarked on any real hardware, or even carefully tested. There might be bugs in the shuffle constants or something.
If you're making a fallback / baseline version of your code, remember that only old CPUs will run it; newer CPUs will run your AVX version, or SSE4.1 or whatever.
Old CPUs like K8, and Core2(merom) and earlier only have 64bit shuffle units. Core2 has 128bit execution units for most instructions, but not for shuffles. (Pentium M and K8 handle all 128b vector instructions as two 64bit halves).
Shuffles like movhlps
that move data in 64-bit chunks (no shuffling within 64-bit halves) are fast, too.
Related: shuffles on new CPUs, and tricks for avoiding 1/clock shuffle throughput bottleneck on Haswell and later: Do 128bit cross lane operations in AVX512 give better performance?
On old CPUs with slow shuffles:
-
movhlps
(Merom: 1uop) is significantly faster thanshufps
(Merom: 3uops). On Pentium-M, cheaper thanmovaps
. Also, it runs in the FP domain on Core2, avoiding the bypass delays from other shuffles. -
unpcklpd
is faster thanunpcklps
. -
pshufd
is slow,pshuflw
/pshufhw
are fast (because they only shuffle a 64bit half) -
pshufb mm0
(MMX) is fast,pshufb xmm0
is slow. -
haddps
is very slow (6uops on Merom and Pentium M) -
movshdup
(Merom: 1uop) is interesting: It's the only 1uop insn that shuffles within 64b elements.
shufps
on Core2(including Penryn) brings data into the integer domain, causing a bypass delay to get it back to the FP execution units for addps
, but movhlps
is entirely in the FP domain. shufpd
also runs in the float domain.
movshdup
runs in the integer domain, but is only one uop.
AMD K10, Intel Core2(Penryn/Wolfdale), and all later CPUs, run all xmm shuffles as a single uop. (But note the bypass delay with shufps
on Penryn, avoided with movhlps
)
Without AVX, avoiding wasted movaps
/movdqa
instructions requires careful choice of shuffles. Only a few shuffles work as a copy-and-shuffle, rather than modifying the destination. Shuffles that combine data from two inputs (like unpck*
or movhlps
) can be used with a tmp variable that's no longer needed instead of _mm_movehl_ps(same,same)
.
Some of these can be made faster (save a MOVAPS) but uglier / less "clean" by taking a dummy arg for use as a destination for an initial shuffle. For example:
// Use dummy = a recently-dead variable that vec depends on,
// so it doesn't introduce a false dependency,
// and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
// With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
(void)dummy;
return _mm_unpackhi_pd(vec, vec);
#else
// Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
__m128 tmp = _mm_castpd_ps(dummy);
__m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
return high;
#endif
}
SSE1 (aka SSE):
float hsum_ps_sse1(__m128 v) { // v = [ D C | B A ]
__m128 shuf = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)); // [ C D | A B ]
__m128 sums = _mm_add_ps(v, shuf); // sums = [ D+C C+D | B+A A+B ]
shuf = _mm_movehl_ps(shuf, sums); // [ C D | D+C C+D ] // let the compiler avoid a mov by reusing shuf
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
# gcc 5.3 -O3: looks optimal
movaps xmm1, xmm0 # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
shufps xmm1, xmm0, 177
addps xmm0, xmm1
movhlps xmm1, xmm0 # note the reuse of shuf, avoiding a movaps
addss xmm0, xmm1
# clang 3.7.1 -O3:
movaps xmm1, xmm0
shufps xmm1, xmm1, 177
addps xmm1, xmm0
movaps xmm0, xmm1
shufpd xmm0, xmm0, 1
addss xmm0, xmm1
I reported a clang bug about pessimizing the shuffles. It has its own internal representation for shuffling, and turns that back into shuffles. gcc more often uses the instructions that directly match the intrinsic you used.
Often clang does better than gcc, in code where the instruction choice isn't hand-tuned, or constant-propagation can simplify things even when the intrinsics are optimal for the non-constant case. Overall it's a good thing that compilers work like a proper compiler for intrinsics, not just an assembler. Compilers can often generate good asm from scalar C that doesn't even try to work the way good asm would. Eventually compilers will treat intrinsics as just another C operator as input for the optimizer.
SSE3
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
# gcc 5.3 -O3: perfectly optimal code
movshdup xmm1, xmm0
addps xmm0, xmm1
movhlps xmm1, xmm0
addss xmm0, xmm1
This has several advantages:
-
doesn't require any
movaps
copies to work around destructive shuffles (without AVX):movshdup xmm1, xmm2
's destination is write-only, so it createstmp
out of a dead register for us. This is also why I usedmovehl_ps(tmp, sums)
instead ofmovehl_ps(sums, sums)
. -
small code-size. The shuffling instructions are small:
movhlps
is 3 bytes,movshdup
is 4 bytes (same asshufps
). No immediate byte is required, so with AVX,vshufps
is 5 bytes butvmovhlps
andvmovshdup
are both 4.
I could save another byte with addps
instead of addss
. Since this won't be used inside inner loops, the extra energy to switch the extra transistors is probably negligible. FP exceptions from the upper 3 elements aren't a risk, because all elements hold valid FP data. However, clang/LLVM actually "understands" vector shuffles, and emits better code if it knows that only the low element matters.
Like the SSE1 version, adding the odd elements to themselves may cause FP exceptions (like overflow) that wouldn't happen otherwise, but this shouldn't be a problem. Denormals are slow, but IIRC producing a +Inf result isn't on most uarches.
SSE3 optimizing for code-size
If code-size is your major concern, two haddps
(_mm_hadd_ps
) instructions will do the trick (Paul R's answer). This is also the easiest to type and remember. It is not fast, though. Even Intel Skylake still decodes each haddps
to 3 uops, with 6 cycle latency. So even though it saves machine-code bytes (L1 I-cache), it takes up more space in the more-valuable uop-cache. Real use-cases for haddps
: a transpose-and-sum problem, or doing some scaling at an intermediate step in this SSE atoi()
implementation.
AVX:
This version saves a code byte vs. Marat's answer to the AVX question.
#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
__m128 vlow = _mm256_castps256_ps128(v);
__m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
vlow = _mm_add_ps(vlow, vhigh); // add the low 128
return hsum_ps_sse3(vlow); // and inline the sse3 version, which is optimal for AVX
// (no wasted instructions, and all of them are the 4B minimum)
}
#endif
vmovaps xmm1,xmm0 # huh, what the heck gcc? Just extract to xmm1
vextractf128 xmm0,ymm0,0x1
vaddps xmm0,xmm1,xmm0
vmovshdup xmm1,xmm0
vaddps xmm0,xmm1,xmm0
vmovhlps xmm1,xmm1,xmm0
vaddss xmm0,xmm0,xmm1
vzeroupper
ret
Double-precision:
double hsum_pd_sse2(__m128d vd) { // v = [ B | A ]
__m128 undef = _mm_undefined_ps(); // don't worry, we only use addSD, never touching the garbage bits with an FP add
__m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd)); // there is no movhlpd
__m128d shuf = _mm_castps_pd(shuftmp);
return _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}
# gcc 5.3.0 -O3
pxor xmm1, xmm1 # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
movhlps xmm1, xmm0
addsd xmm0, xmm1
# clang 3.7.1 -O3 again doesn't use movhlps:
xorpd xmm2, xmm2 # with #define _mm_undefined_ps _mm_setzero_ps
movapd xmm1, xmm0
unpckhpd xmm1, xmm2
addsd xmm1, xmm0
movapd xmm0, xmm1 # another clang bug: wrong choice of operand order
// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
double tmp;
_mm_storeh_pd(&tmp, vd); // store the high half
double lo = _mm_cvtsd_f64(vd); // cast the low half
return lo+tmp;
}
# gcc 5.3 -O3
haddpd xmm0, xmm0 # Lower latency but less throughput than storing to memory
# ICC13
movhpd QWORD PTR [-8+rsp], xmm0 # only needs the store port, not the shuffle unit
addsd xmm0, QWORD PTR [-8+rsp]
Storing to memory and back avoids an ALU uop. That's good if shuffle port pressure, or ALU uops in general, are a bottleneck. (Note that it doesn't need to sub rsp, 8
or anything because the x86-64 SysV ABI provides a red-zone that signal handlers won't step on.)
Some people store to an array and sum all the elements, but compilers usually don't realize that the low element of the array is still there in a register from before the store.
Integer:
pshufd
is a convenient copy-and-shuffle. Bit and byte shifts are unfortunately in-place, and punpckhqdq
puts the high half of the destination in the low half of the result, opposite of the way movhlps
can extract the high half into a different register.
Using movhlps
for the first step might be good on some CPUs, but only if we have a scratch reg. pshufd
is a safe choice, and fast on everything after Merom.
int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
__m128i hi64 = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // SSE2 movd
//return _mm_extract_epi32(hl, 0); // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}
# gcc 5.3 -O3
pshufd xmm1,xmm0,0x4e
paddd xmm0,xmm1
pshuflw xmm1,xmm0,0x4e
paddd xmm0,xmm1
movd eax,xmm0
int hsum_epi32_ssse3_slow_smallcode(__m128i x){
x = _mm_hadd_epi32(x, x);
x = _mm_hadd_epi32(x, x);
return _mm_cvtsi128_si32(x);
}
On some CPUs, it's safe to use FP shuffles on integer data. I didn't do this, since on modern CPUs that will at most save 1 or 2 code bytes, with no speed gains (other than code size/alignment effects).
Solution 2
SSE2
All four:
const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
r1+r2+r3:
const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));
I've found these to be about same speed as double HADDPS
(but I haven't measured too closely).
Solution 3
You can do it in two HADDPS
instructions in SSE3:
v = _mm_hadd_ps(v, v);
v = _mm_hadd_ps(v, v);
This puts the sum in all elements.
Solution 4
I would definitely give SSE 4.2 a try. If you are doing this multiple times (I assume you are if performance is an issue), you can pre-load a register with (1,1,1,1), and then do several dot4(my_vec(s), one_vec) on it. Yes, it does a superfluous multiply, but those are fairly cheap these days and such an op is likely to be dominated by the horizontal dependencies, which may be more optimized in the new SSE dot product function. You should test to see if it outperforms the double horizontal add Paul R posted.
I also suggest comparing it to straight scalar (or scalar SSE) code - strangely enough it is often faster (usually because internally it is serialized but tightly pipelined using register bypass, where special horizontal instructions may not be fast pathed (yet)) unless you are running SIMT-like code, which it sounds like you are not (otherwise you would do four dot products).
Solution 5
Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.
Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.
The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).
inline vec update(vec context, vec data) {
vec even = get_evens(context, data);
vec odd = get_odds(context, data);
return vertical_operation(even, odd);
}
void my_algo(vec *data, int N, vec_element_type *out) {
vec4 context{0,0,0,0};
context = update(context, data[0]);
int i;
for (int i = 0; i < N-1; i++) {
context = update(context, data[i+1]);
output[i] = extract_lane(context, 1);
}
context = update(context, anything);
output[N-1] = extract_lane(context, 1);
}
The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.
Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]
evens = [ -- ][ -- ][ i0 ][ i2 ]
odds = [ -- ][ -- ][ i1 ][ i3 ]
------- vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]
input = [ 4 ][ 5 ][ 6 ][ 7 ]
evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds = [ -- ][ 23 ][ 5 ][ 7 ]
Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]
New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds = [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.
If the horizontal operation happens to be sum -- then one can actually use just a single hadd
per iteration.
FeepingCreature
Updated on January 10, 2022Comments
-
FeepingCreature over 2 years
Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?
What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?
"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-) e.g. for summing an array, you can use multiple vector accumulators for vertical sums (to hide addps latency), and reduce down to one after the loop, but then you need to horizontally sum that last vector.
-
Paul R over 12 yearsIf horizontal adds are performance-critical for you then you may well be approaching SIMD coding in a less than optimal way - post some code that shows how and where you need to do this.
-
FeepingCreature over 12 yearsDot product for angles between vectors, mainly. Note the last sentence.
-
Paul R over 12 yearsI read the last sentence, but I still think there may be a better way.
-
FeepingCreature over 12 yearsI know there's a better way, and it's "execute loops four elements at a time so you can parallelize everything". The question is, what's the best we can do excluding that way (which is complicated and obfuscating)?
-
Paul R over 12 yearsThere may be more than one "better way" though - but if you don't post any code then it's hard to give specific help.
-
FeepingCreature over 12 years@PaulR let us continue this discussion in chat
-
Paul R over 12 yearsOK - I'll keep an eye on chat...
-
Stephen Canon over 12 yearsThere is no "fastest way ... on x86". Different x86 processors have different execution characteristics. What processor are you targeting? Is your "vector of three floats" in memory initially, or contiguously in an SSE register, or somewhere else?
-
-
Jens Björnhager over 12 yearsDoesn't the sum end up in all elements?
-
Paul R over 12 years@Jens: yes, thanks - I think you're right - I'll update my answer.
-
FeepingCreature over 12 yearsFor a 3-vector sum, I'd need to set the fourth component to zero first. What's the fastest way to do that? I'm tending towards "load mask, andps" - is there a fast way to mask out an element?
-
Paul R over 12 yearsI don't see any faster way than
ANDPS
, which is one instruction (the mask being constant of course). -
awdz9nld over 10 years@FeepingCreature
__m128 vector3 = _mm_castps_si128(_mm_castsi128_ps(_mm_srli_si128(vector4, 4)));
- this may be faster than masking depending on whether your mask is already loaded from memory -
Peter Cordes about 8 yearsEven in Skylake, one
dpps
is 4 uops, 13c latency. (But one per 1.5c throughput).haddps
is 3uops, 6c latency. (one per 2c throughput). Store and scalar is not too bad because it doesn't cost many uops, but it's pretty bad for latency compared to Kornel's answer. Scalar ops have the same latency as vector ops, though. Your "tightly pipelined using register bypass" speculation isn't correct. Everything except div is fully pipelined, but you're right that horizontal instructions aren't fast-pathed. They're decoded to internal shuffle uops. -
plasmacel over 7 yearsWith SSE2 the remaining
movaps
before theshufps
also can be eliminated if you usepshufd
by changing_mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));
to_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(2, 3, 0, 1)));
. However that maybe adds some latency. godbolt.org/g/0trqRY -
Peter Cordes over 7 years@plasmacel: on many CPUs, including Intel SnB-family, there's extra bypass-delay latency to forward the result of an FP instruction to an integer shuffle, and from PSHUFD to ADDPS. It's great if you care about throughput and uop count but not latency. (SHUFPS between integer instructions has no penalty on SnB-family (unlike Nehalem), but the reverse is not true.)
-
Peter Cordes over 7 years@plasmacel: keep in mind that these functions really need to inline to be useful. And yes, clang pessimizes the shuffles sometimes. That's especially bad for first-gen Core2 and other slow-shuffle CPUs where SHUFPS is far worse than MOVHLPS :( If you enable sse3 (godbolt.org/g/1qbNXw), though, it can use MOVSHDUP for Kornel's first shuffle, which is excellent. Anyway, if you are using clang, use whatever happens to coax clang into making nice asm after inlining. You could even write a version which takes a dummy arg to use as a target for movhlps (when AVX isn't available).
-
Peter Cordes over 7 yearsIf you have a specific microarchitecture and compiler in mind, you can and should make a version that's more optimal for that. This answer tries to be optimal (latency, throughput and code-size) for modern CPUs like Haswell, while sucking at little as possible on old CPUs. i.e. my SSE1 / SSE2 versions don't do anything that's worse on Haswell just to run faster on an old SlowShuffle CPU like Merom. For Merom, PSHUFD might be a win because it and SHUFPS both run in flt->int domain.
-
plasmacel over 7 yearsIs it a win to use
vpermilps
instead ofmovsldup
,movshdup
,movhlps
andmovlhps
when AVX is available? It is a win overshufps
and looks like clang also tries to emit it instead of the mentioned ones. -
Peter Cordes over 7 years@plasmacel: no, unless your vector was in memory to start with, since VPERMILPS can load+shuffle. You get smaller code-size from using the AVX versions of older instructions, because you don't need an immediate, and they only need the 2-byte VEX prefix (
C5 ..
instead ofC4 .. ..
). Two-source shuffles like VSHUFPS and VMOVHLPS aren't any slower than one-source shuffles like VPSHUFD or VPERMILPS. If there's a difference in energy consumption, it's probably negligible. -
Peter Cordes over 7 years@plasmacel: As my answer points out, my SSE3 version compiles optimally with AVX, but clang pessimises it to VPERMILPD: godbolt.org/g/ZH88wH. gcc's version is four 4B instructions (not counting the RET). clang's version is 2 bytes longer, and the same speed. What makes you think VPERMILPS is a win over SHUFPS? AFAIK, clang is wrong to favour it for immediate shuffles where the source is already in a register. Agner Fog's tables show no difference. It's useful for load+shuffle, and for variable-shuffles, and maybe easier for compilers since it's a 1-input instruction, but not faster
-
Peter Cordes over 7 years@plasmacel: fun fact: on Knight's Landing (Xeon Phi = modified silvermont + AVX512), VPERMILPS (3c lat, 1c rtput) is more efficient than VSHUFPS (4c lat, 2c rtput), which does outweight the instruction-length difference for that architecture. I assume that's from being a 1-input shuffle vs 2-input. Agner Fog updated his stuff for KNL. :)
-
plasmacel over 7 yearsThanks for all the provided info here. It's time to detect the
__AVX512F__
macro. :) -
arrowd over 7 years@PeterCordes Thank you for great answer. Don't you have a typo in
SSE1 (aka SSE)
section at line_mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)); // [ C D | B A ]
? I guess, you meant[ C D | A B ]
? -
Royi about 7 yearsHi, How does it compare to @Peter Cordes SSE3 Solution? Thank You.
-
Royi about 7 years@PeterCordes, How does your SSE3 solution compares to @PaulR solution -
v = _mm_hadd_ps(v, v); v = _mm_hadd_ps(v, v);
? Thank You. -
Paul R about 7 years@Royi: see Peter's comments in his answer, under the heading "SSE3 optimizing for code-size".
-
Peter Cordes about 7 years@Royi: There are already a couple sections in my answer that discuss the fact that _mm_hadd_ps is slow.
-
jww over 6 years@PeterCordes - Out of curiosity, have you written any books on x86, assembly and intrinsics. I've been looking for a good book (with recipes) for several years now. The intrinsics are important because they are cross-platform. They work on Clang, GCC, MSVC, SunCC, etc. We can write them once and they run everywhere (unlike ASM for GNU's GAS).
-
Peter Cordes over 6 years@jww: No. I wouldn't want to set anything in stone that I couldn't come back and edit if/when I realize my advice wasn't optimal after all. I did get an email once asking me if I wanted to be part of writing an asm book, but I never got back to them >.< Anyway, collecting up links to the more useful SO answers with "recipes" that I and others have written would be a good project, if I ever got around to it.
-
Nawaz almost 4 years@PeterCordes.. Wow this answer. Will spend my weekend to fully understand it. I'm trying to compute dot product and compiling with clang 10, on my Mac Mojave, and I've used
-O3 -march=native
but it does not generate instructions likevmulpd
,vaddpd
like this: godbolt.org/z/5j4bPq, instead it generatesmulpd
andaddpd
. Any idea/recommendation how I generate the former? I assumevmulpd
is a faster instruction thanmulpd
? -
Peter Cordes almost 4 years@Nawaz: no, it's just the VEX encoding of the same instruction, requiring AVX. If
-march=native
generates the old SSE encoding, then your CPU doesn't support AVX. Godbolt runs on Skylake-avx512 servers so-march=native
there is-march=skylake-avx512
. -
Nawaz almost 4 yearsHmm. Seems like
vmulpd
is not faster; it's just a different variant ofmulpd
which stores the result in one of the operand itself andvmulpd
stores in different register. Somulpd
is likea *= b
andvmulpd
is likec = a * b
.. please let me know if my understanding is right? felixcloutier.com/x86/mulpd -
Peter Cordes almost 4 years@Nawaz: That's literally what I just said, same instruction. And yes, the VEX encoding just adds a non-destructive destination. Except
mulpd
of course is*
not+
. And yes, Intel's asm manuals are pretty clear. See also uops.info/table.html for performance info, and agner.org/optimize to understand what the numbers mean. (more links in stackoverflow.com/tags/x86/info) -
Nawaz almost 4 yearsOops fixed that
-
Nawaz almost 4 yearsthen your CPU doesn't support AVX. ..
sysctl -a | grep machdep.cpu.features | rg -i avx
lists this:FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
... which hasAVX1.0
.. That means my CPU supports AVX? -
Peter Cordes almost 4 years@Nawaz: then
-march=native
should be using VEX versions of SSE instructions. GCC and clang both work that way. Post a question (not comments) if it's still happening after you double check that you're actually passing that compiler option correctly, and you made sure you're looking at the correct output file, etc. i.e. that it's not just a problem in your build script. -
Nawaz almost 4 yearsah. You're right. I was looking at an older generated
.s
file. Haha. Now it does generatevmulpd
instructions. However, it seems to be slow (or maybe my machine right now is loaded too much). Do you thinkvmulpd
in general is faster thanmulpd
? Or it depends and cannot be said without looking at the code? I'll post a question if I face any specific issue though. -
Peter Cordes almost 4 years@Nawaz: It's generally equal, except when it defeats micro-fusion of an indexed addressing mode on Haswell and later (Micro fusion and addressing modes). But you only have AVX1, not AVX2, so it's pre-Haswell and
mulpd
with an indexed addressing mode would unlaminate as well. (TL:DR: it depends). If it's all compiler-generated, no hand-written asm, you at least don't have to worry about AVX / SSE transition stalls. (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?)