error: inlining failed to call always_inline
GCC will only let you use intrinsics for instruction sets that are enabled for the compiler to use. e.g. a related question about an AVX1 intrinsic: inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)'
These are _mask_
versions of 256-bit intrinsics, they require AVX512VL.
(My comments under the question about -mavx
were wrong, I didn't notice the _mask
in the name or args, just the _mm256
.)
You're probably compiling on KNL (Knight's Landing / Xeon Phi) on your server, which has AVX512F but not AVX512VL. So -march=native
will set -mavx512f
. (Unlike Skylake-AVX512 which does have AVX512VL allowing use of cool new AVX512 stuff like masked instructions with narrower vectors.)
And you've found a bug in your tensor.hpp
, where you use AVX512VL intrinsics after only checking for __AVX512F__
instead of __AVX512VL__
. AVX512-anything implies 512F, so it doesn't need to check both.
#ifdef __AVX512F__ // should be __AVX512VL__
Tensor<T> Tensor::addAVX512(_param_){
res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
}
#endif
This is just pointless, you don't need to use the masked versions of these intrinsics if you're going to use constant all-ones masks. Use _mm256_add_pd
like a normal person and only check for __AVX__
. Or use _mm512_add_pd
.
I thought at first this was from TensorFlow, but (from your comments) that doesn't make sense. And it can't be that badly written. Merge-masking into 3 copies of the same tmp
with an all-true mask just makes no sense; it looks like a silly way to introduce a false dependency if the compiler can't optimize away the mask=all-ones into an unmasked load.
And also terrible C++ style: you have a variable called __m256d tmp
as a global or class member?? It's not even a local dummy variable, it may exist somewhere the compiler can't fully optimize it away.
Clebo Sevic
Updated on June 04, 2022Comments
-
Clebo Sevic almost 2 years
I am trying to implement and code on some files, some of which contain SIMD-calls. I have compiled this code on a server, running basically the same OS as my machine, yet i cant compile it.
This is the error:
make g++ main.cpp -march=native -o main -fopenmp In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0, from tensor.hpp:9, from main.cpp:4: /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h: In function ‘_ZN6TensorIdE8add_avx2ERKS0_._omp_fn.5’: /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:447:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_add_pd(__m256d, __mmask8, __m256d, __m256d)’: target specific option mismatch _mm256_mask_add_pd (__m256d __W, __mmask8 __U, __m256d __A, ^~~~~~~~~~~~~~~~~~ In file included from main.cpp:4:0: tensor.hpp:228:33: note: called from here res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i])); ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0, from tensor.hpp:9, from main.cpp:4: /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch _mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P) ^~~~~~~~~~~~~~~~~~~~ In file included from main.cpp:4:0: tensor.hpp:228:33: note: called from here res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i])); ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0, from tensor.hpp:9, from main.cpp:4: /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch _mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P) ^~~~~~~~~~~~~~~~~~~~ In file included from main.cpp:4:0: tensor.hpp:228:33: note: called from here res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i])); ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Makefile:7: recipe for target 'main' failed make: *** [main] Error 1
Googling the problem didnt really help, as all answers pointed things out, i allready do/tried.
Can somebody provide some background as to why it doesn´t work.
EDIT:
int main(){ #ifdef __AVX512F___ auto tt = createTensor(); auto tt2 = createTensor(); auto res = tt.addAVX512(tt2); #endif } //This is in tensor.hpp #ifdef __AVX512F__ Tensor<T> Tensor::addAVX512(_param_){ res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i])); } #endif
This it the gist of what happens ... i have encased all SIMDcalls in #ifdefs, etc.
-
Barmar about 5 years@bruno That question is about
cmake
. -
Barmar about 5 yearsIt's also about C, not C++.
-
Peter Cordes about 5 years@Barmar: It's about leaving out
-msse4.1
when compiling code using SSE4.1 intrinsics. Or in this case, leaving out-mavx
or-march=haswell
when compiling AVX intrinsics. -
Clebo Sevic about 5 years@bruno no, i already found that one and it does not help
-
Clebo Sevic about 5 years@PeterCordes then please tell me what i have to do and dont just say its a duplicate ... i dont really get the other post
-
Peter Cordes about 5 years@bruno: If the OP hadn't been using
-march=native
, inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)' would be an exact duplicate:-mavx
is the relevant option for these intrinsics. But for this case, it would just let the OP make a binary they couldn't run. Either their server is very old, or it's using crappy virtualization that doesn't enable AVX for guests, or it's running on a Pentium / Celeron CPU (even Skylake Pentium disables AVX, presumably so they can sell chips with defects in the upper 128 bits of FMA u -
Clebo Sevic about 5 years@PeterCordes yeah, uhm, my PC runs on an Intel i7 7700k, so not really old, and pretty sure it supports even AVX2(i implemented all functions in SSE AVX2 and AVX512)
-
Jörn Horstmann about 5 years@CleboSevic the 7700k does not have avx512 (ark.intel.com/products/97129/…)
-
Clebo Sevic about 5 years@JörnHorstmann i know, i am coding for a skylake cpu, but work on my home-desktop ... i only need to know if it works, but as i said, these functions are not the important part
-
-
Clebo Sevic about 5 yearsThanks, it seems i used AVX512-Instructions in an AVX2 block, ill look into that ... but this fixed it(commenting them out, as my main focus right now is someplace else)
-
Peter Cordes about 5 years@CleboSevic: see my update: the block from tensor.hpp that you quoted is using masked intrinsics for no reason or benefit.
-
Clebo Sevic about 5 yearsactually i was kind of mistaken in my code snippet. The function-call is not in a AVX512-Block but an AVX2-Block, which never really caused trouble, since the server from before is a skylake CPU ... anyways, a have to look for a AVX2 equivalent to the addition, to get it running again, but thanks anyways :thumbs_up:
-
Peter Cordes about 5 years@CleboSevic: I already suggested in my answer that you use AVX1
_mm256_add_pd
/_mm256_loadu_pd
, just remove the_mask
part. Masking is a new feature with AVX512. But in general see Intel's intrinsics finder: software.intel.com/sites/landingpage/IntrinsicsGuide