error: inlining failed to call always_inline

c++ gcc makefile simd avx

11,181

GCC will only let you use intrinsics for instruction sets that are enabled for the compiler to use. e.g. a related question about an AVX1 intrinsic: inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)'

These are _mask_ versions of 256-bit intrinsics, they require AVX512VL.

(My comments under the question about -mavx were wrong, I didn't notice the _mask in the name or args, just the _mm256.)

You're probably compiling on KNL (Knight's Landing / Xeon Phi) on your server, which has AVX512F but not AVX512VL. So -march=native will set -mavx512f. (Unlike Skylake-AVX512 which does have AVX512VL allowing use of cool new AVX512 stuff like masked instructions with narrower vectors.)

And you've found a bug in your tensor.hpp, where you use AVX512VL intrinsics after only checking for __AVX512F__ instead of __AVX512VL__. AVX512-anything implies 512F, so it doesn't need to check both.

#ifdef __AVX512F__    // should be __AVX512VL__
Tensor<T> Tensor::addAVX512(_param_){
   res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
}
#endif

This is just pointless, you don't need to use the masked versions of these intrinsics if you're going to use constant all-ones masks. Use _mm256_add_pd like a normal person and only check for __AVX__. Or use _mm512_add_pd.

I thought at first this was from TensorFlow, but (from your comments) that doesn't make sense. And it can't be that badly written. Merge-masking into 3 copies of the same tmp with an all-true mask just makes no sense; it looks like a silly way to introduce a false dependency if the compiler can't optimize away the mask=all-ones into an unmasked load.

And also terrible C++ style: you have a variable called __m256d tmp as a global or class member?? It's not even a local dummy variable, it may exist somewhere the compiler can't fully optimize it away.

11,181

Author by

Clebo Sevic

Updated on June 04, 2022

Comments

Clebo Sevic almost 2 years

I am trying to implement and code on some files, some of which contain SIMD-calls. I have compiled this code on a server, running basically the same OS as my machine, yet i cant compile it.

This is the error:

make
g++ main.cpp -march=native -o main -fopenmp
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
                 from tensor.hpp:9,
                 from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h: In function ‘_ZN6TensorIdE8add_avx2ERKS0_._omp_fn.5’:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:447:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_add_pd(__m256d, __mmask8, __m256d, __m256d)’: target specific option mismatch
 _mm256_mask_add_pd (__m256d __W, __mmask8 __U, __m256d __A,
 ^~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
         res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
               ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
                 from tensor.hpp:9,
                 from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
 _mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P)
 ^~~~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
         res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
               ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
                 from tensor.hpp:9,
                 from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
 _mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P)
 ^~~~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
         res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
               ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Makefile:7: recipe for target 'main' failed
make: *** [main] Error 1

Googling the problem didnt really help, as all answers pointed things out, i allready do/tried.

Can somebody provide some background as to why it doesn´t work.

EDIT:

int main(){
#ifdef __AVX512F___
    auto tt = createTensor();
    auto tt2 = createTensor();
    auto res = tt.addAVX512(tt2);
#endif
}

//This is in tensor.hpp
#ifdef __AVX512F__
Tensor<T> Tensor::addAVX512(_param_){
   res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
}
#endif

This it the gist of what happens ... i have encased all SIMDcalls in #ifdefs, etc.

Barmar about 5 years

@bruno That question is about cmake.
Barmar about 5 years

It's also about C, not C++.
Peter Cordes about 5 years

@Barmar: It's about leaving out -msse4.1 when compiling code using SSE4.1 intrinsics. Or in this case, leaving out -mavx or -march=haswell when compiling AVX intrinsics.
Clebo Sevic about 5 years

@bruno no, i already found that one and it does not help
Clebo Sevic about 5 years

@PeterCordes then please tell me what i have to do and dont just say its a duplicate ... i dont really get the other post
Peter Cordes about 5 years

@bruno: If the OP hadn't been using -march=native, inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)' would be an exact duplicate: -mavx is the relevant option for these intrinsics. But for this case, it would just let the OP make a binary they couldn't run. Either their server is very old, or it's using crappy virtualization that doesn't enable AVX for guests, or it's running on a Pentium / Celeron CPU (even Skylake Pentium disables AVX, presumably so they can sell chips with defects in the upper 128 bits of FMA u
Clebo Sevic about 5 years

@PeterCordes yeah, uhm, my PC runs on an Intel i7 7700k, so not really old, and pretty sure it supports even AVX2(i implemented all functions in SSE AVX2 and AVX512)
Jörn Horstmann about 5 years

@CleboSevic the 7700k does not have avx512 (ark.intel.com/products/97129/…)
Clebo Sevic about 5 years

@JörnHorstmann i know, i am coding for a skylake cpu, but work on my home-desktop ... i only need to know if it works, but as i said, these functions are not the important part

Clebo Sevic about 5 years

Thanks, it seems i used AVX512-Instructions in an AVX2 block, ill look into that ... but this fixed it(commenting them out, as my main focus right now is someplace else)
Peter Cordes about 5 years

@CleboSevic: see my update: the block from tensor.hpp that you quoted is using masked intrinsics for no reason or benefit.
Clebo Sevic about 5 years

actually i was kind of mistaken in my code snippet. The function-call is not in a AVX512-Block but an AVX2-Block, which never really caused trouble, since the server from before is a skylake CPU ... anyways, a have to look for a AVX2 equivalent to the addition, to get it running again, but thanks anyways :thumbs_up:
Peter Cordes about 5 years

@CleboSevic: I already suggested in my answer that you use AVX1 _mm256_add_pd / _mm256_loadu_pd, just remove the _mask part. Masking is a new feature with AVX512. But in general see Intel's intrinsics finder: software.intel.com/sites/landingpage/IntrinsicsGuide