AVX 512 vs AVX2 performance for simple array processing loops

performance x86 micro-optimization avx2 avx512

11,443

Solution 1

This seems too broad, but there are actually some microarchitectural details worth mentioning.

Note that AVX512-VL (Vector Length) lets you use new AVX512 instructions (like packed uint64_t <-> double conversion, mask registers, etc) on 128 and 256-bit vectors. Modern compilers typically auto-vectorize with 256-bit vectors when tuning for Skylake-AVX512, aka Skylake-X. e.g. gcc -march=native or gcc -march=skylake-avx512, unless you override the tuning options to set the preferred vector width to 512 for code where the tradeoffs are worth it. See @zam's answer.

Some major things with 512-bit vectors (not 256-bit with AVX512 instruction like vpxord ymm30, ymm29, ymm10) on Skylake-X are:

Aligning your data to the vector width is more important than with AVX2 (every unaligned load crosses a cache-line boundary, instead of every other while looping over an array). In practice it makes a bigger difference. I totally forget the exact results of something I tested a while ago, but maybe 20% slowdown vs. under 5% from misalignment.
Running 512-bit uops shuts down the vector ALU on port 1. (But not the integer execution units on port 1). Some Skylake-X CPUs (e.g. Xeon Bronze) only have 1 per clock 512-bit FMA throughput, but i7 / i9 Skylake-X CPUs, and the higher-end Xeons, have an extra 512-bit FMA unit on port 5 that powers up for AVX512 "mode".

So plan accordingly: you won't get double speed from widening to AVX512, and the bottleneck in your code might now be in the back-end.
Running 512-bit uops also limits your max Turbo, so wall-clock speedups can be lower than core-clock-cycle speedups. There are two levels of Turbo reduction: any 512-bit operation at all, and then heavy 512-bit, like sustained FMAs.
The FP divide execution unit for vsqrtps/pd zmm and vdivps/pd is not full width; it's only 128-bit wide, so the ratio of div/sqrt vs. multiply throughput is worse by about another factor of 2. See Floating point division vs floating point multiplication. SKX throughput for vsqrtps xmm/ymm/zmm is one per 3/6/12 cycles. double-precision is the same ratios but worse throughput and latency.

Up to 256-bit YMM vectors, the latency is the same as XMM (12 cycles for sqrt), but for 512-bit ZMM the latency goes up to 20 cycles, and it takes 3 uops. (https://agner.org/optimize/ for instruction tables.)

If you bottleneck on the divider and can't get more other instructions in the mix, VRSQRT14PS is worth considering even if you need a Newton iteration to get enough precision. But note that AVX512's approximate 1/sqrt(x) does have more guaranteed-accuracy bits than AVX/SSE.)

As far as auto-vectorization, if there are any shuffles required, compilers might do a worse job with wider vectors. For simple pure-vertical stuff, compilers can do ok with AVX512.

Your previous question had a sin function, and maybe if the compiler / SIMD math library only has a 256-bit version of that it won't auto-vectorize with AVX512.

If AVX512 doesn't help, maybe you're bottlenecked on memory bandwidth. Profile with performance counters and find out. Or try more repeats of smaller buffer sizes and see if it speeds up significantly when your data is hot in cache. If so, try to cache-block your code, or increase computational intensity by doing more in one pass over the data.

AVX512 does double theoretical max FMA throughput on an i9 (and integer multiply, and many other things that run on the same execution unit), making the mismatch between DRAM and execution units twice as big. So there's twice as much to gain from making better use of L2 / L1d cache.

Working with data while it's already loaded in registers is good.

Solution 2

How did you compile (enable AVX512) your code in case of ICL or GCC? There are two "operating modes" for AVX-512 codes:

For fresh Intel Compiler (starting 18.0 / 17.0.5), if using [Qa]xCORE-AVX512, you'll only enable AVX-512-VL which basically means AVX512 ISA but with 256bits-wide operands. This also seems to be default behavior for GCC.
Otherwise, if (a) using older Intel Compiler, or (b) using [Qa]xCOMMON-AVX512 or (c) if using special new flag [Q/q]opt-zmm-usage=high, you'll get full AVX-512 ISA with 512-bits-wide operands. (given sophisticated flags logic is described here). This mode can also be enabled using -mprefer-vector-width=512 in case of GCC8 or newer.

If your code is "AVX512-friendly" (you have long sequences of well-vectorized codes without scalar pieces of code "interrupting" sequence of vector instructions), the mode (2) is way preferrable and you have to enable it (which is not by default).

Otherwise, if your code is not very AVX512-friendly (many non-vectorized pieces of code in between of vector code), then due to SKX "frequency throttling" AVX512VL could be sometimes more beneficial (at least until you do more code vectorization) and therefore you should make sure you are operating in mode (1). The landscape with frequencies vs. ISA is for example described in Dr. Lemier blogs (although the picture given in blog is a bit overpessimistic compared to reality) : https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ and https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-throttling-myth-or-reality/

11,443

Author by

Vojtěch Melda Meluzín

Updated on June 05, 2022

Comments

Vojtěch Melda Meluzín almost 2 years

I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any pointers? Any good results? (btw. I tried MSVC/CLANG/ICL, no noticeable difference, many times AVX512 code actually seems slower)
Vojtěch Melda Meluzín over 5 years

Thanks! I'm generally testing arrays of 8 to 1024 items (powers of 2), 64-byte aligned, so I'd think memory shouldn't be the problem. I'm also trying various functions containing multiplies, additions etc., just the stuff needed for DSP, so not only sines. I'll keep digging ;). I still have a bit of an issue when it comes to measuring the CPU cycles, seems varying way too much, but it's not that bad now.
Peter Cordes over 5 years

@VojtěchMeldaMeluzín: if that "etc." includes division, sqrt, or sin, that's definitely the most important part.
BeeOnRope over 5 years

About alignment, I think you can characterize it fairly exactly, at least for "throughput": a line-crossing load has a throughput of 1 per cycle, versus 2 per cycle for other loads. So if your algorithm is simple enough that it could achieve 2 loads per cycle, the 256-bit aligned case may be 1.5x faster than the misaligned case (64 bytes/cycle vs 42.6), where as the 512-bit aligned case could be 2x as fast (128 bytes/cycle vs 64). At least if AVX-512 works the same way (I'll check). Well, only for L1-resident loads of course!
Peter Cordes over 5 years

@BeeOnRope: the extra latency for cache-line crossing reduces the ability of OoO exec to hide long dep chains by keeping multiple iterations in flight. It's a real effect that I've seen in code with a polynomial approximation to log(x) as part of asinh(x). You can get an effect even when you're nowhere near saturating L1d bandwidth.
BeeOnRope over 5 years

@PeterCordes good point, was the case you say with a "short" loop, or have you seen it also "steady state" with large loops? This was L1-resident? Once you are not L1-resident then it becomes less clear to me how it works (i.e., now you have to involve the "split line" fill buffers and I don't know how they related to normal fill buffers). I did go ahead and add AVX-512 tests for load/store L1D throughput, and AVX-512 behaves "as expected", results here.
Peter Cordes over 5 years

@BeeOnRope: This was maybe 20 to 30 uops or so, with fairly long non-loop-carried dep chains (like 30 cycles at least) but some ILP within each iteration. Input was maybe hot in L3, and not sure about the output, I forget. It's been over a year.
BeeOnRope over 5 years

@PeterCordes - makes sense. I realized I didn't know much about split-load latency so I added some tests. You can see the results on SKL and SKX in the same gist as before. It looks like 11 cycles latency for split loads in L1 on Intel. AMD on the other hand seems like just 1 cycle penalty (to 5 cycles total) in this scenario.
BeeOnRope over 5 years

For other cache levels there is still a large penalty, such as 22-24 total cycles for L2 (Intel), and for larger sizes that only fit in memory the penalty was more than 3x on SKX and also large on SKL! This is kind of weird since you'd expect both lines to be fetched in parallelism and for this to hide most of the penalty, but that seems not to be the case. On the AMD Zen the large region penalty is much less but the overall timing is about as slow as the split case on Intel so either Intel is really good at non-split loads or perhaps something is off with the tests.
BeeOnRope over 5 years

Anyways, I withdraw by "can characterize it fairly exactly" statement since it's mostly untrue except in special cases.
Peter Cordes over 5 years

@BeeOnRope: oh wow, I didn't know anything about AMD's split-load latency penalty. Interesting it's so low; I guess they just see it coming and do 2 loads instead of taking a slow path. There'd still be throughput consequences. And yeah, 11c total latency for a split load on Intel sounds about right from what I remember timing. Yup, that's exactly what I found, too: How can I accurately benchmark unaligned access speed on x86_64
BeeOnRope over 5 years

Yeah it seems like maybe Intel does the split loads back-to-back, at least to the L2, not in parallel but that AMD does them in parallel. AMD has a throughput penalty too, the same as Intel more-or-less for loads except that it applies at 32B boundaries not just 64B (cache line) boundaries. 256-bit loads are different of course since AMD splits 256-bit ops, but the rest is about the same (1 cycle per crossing loads). Stores are worse on AMD.
BeeOnRope over 5 years

FWIW the L1D split load latency penalty seems to have increased since Haswell, I get around 9 cycles there (actually measures as 9.36 consistently in this particular test which I can't really explain). This is consistent with various things having changed in the L1D load path in Skylake client, perhaps in preparation for 64-byte loads in SKX. I also see that in the split cases there are 2 p23 uops issued per load, so it's not just one uop that handles the whole thing: so maybe the scheduler sends the split load down as usual, and then the split load comes back and says ...
BeeOnRope over 5 years

... now do the second half and that takes the other uop. Such a mechanism is already used for L1 misses: you see 2x uops for every L1-miss, L2-hit for example.
Peter Cordes over 5 years

Good point, updated my answer to clearly say that the microarchitectural effects only apply when using 512-bit vectors, not with EVEX-encoded 512-bit instructions.
Z boson about 5 years

I found a more up to date link -xCore-AVX512 -qopt-zmm-usage=high colfaxresearch.com/skl-avx512
Z boson about 5 years

From godbolt, if you look at the assembly ICC does not generate gather except with -qopt-zmm-usage=high. But GCC uses 256-bit gather except with -mprefer-vector-width=512 in which case it uses 512-bit gather. At least in this case ICC did not even use 256-bit AVX512 operations.
zam about 5 years

given colfax documentation should be obsolete (and probably produced before ICC 18 release). GCC and ICC should be relatively consistent and that's what you partially confirm. It is not 512 bits per se which drives maximum frequency usage, so low vs high (and inherited CORE vs COMMON etc) differentiation is likely finer grain, cases by case, than simply 512 vs 256.