Why is strcmp not SIMD optimized?
Solution 1
In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.
If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.
However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2
Solution 2
GCC in this case is using a builtin strcmp
. If you want it to use the version from glibc use -fno-builtin
. But you should not assume that GCC's builtin version of strcmp
or glibc's implementaiton of strcmp
are efficient. I know from experience that GCC's builtin memcpy
and glibc's memcpy
are not as efficient as they could be.
I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm
. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version
compareloop:
add rax, 16 ; increment offset
movdqu xmm1, [rs1+rax] ; read 16 bytes of string 1
pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
jnbe compareloop ; jump if not carry flag and not zero flag
For the generic version he writes
This is a very simple solution. There is not much gained by using SSE2 or anything complicated
Here is the main loop of the generic version:
_compareloop:
mov al, [ss1]
cmp al, [ss2]
jne _notequal
test al, al
jz _equal
inc ss1
inc ss2
jmp _compareloop
I would compare the performance of GCC's builtin strcmp
, GLIBC's strcmp
and the asmlib strcmp
. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy
does not use the builtin version from sizes larger than 8192.
Edit: In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.
Solution 3
When the standard library for C was designed, the implementations of string.h
methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.
Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:
int strcmp(char *p1, char *p2)
{
int idx=0,t1,t2;
do
{
t1=*p1; t2=*p2;
if (t1 != t2)
{
if (t1 > t2) return 1;
return -1;
}
if (!t1)
return 0;
p1++; p2++;
} while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();
While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.
Solution 4
Making an SSE2 version of strcmp
was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.
I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.
So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad
Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp
. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.
Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.
Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra
Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)
@PeterCordes, thanks for the help.
Solution 5
It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.
Convincing the C++ standard library to use that kind of code is difficult.
user1095108
Updated on June 06, 2022Comments
-
user1095108 almost 2 years
I've tried to compile this program on an x64 computer:
#include <cstring> int main(int argc, char* argv[]) { return ::std::strcmp(argv[0], "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really really really" "really really really really really really really long string" ); }
I compiled it like this:
g++ -std=c++11 -msse2 -O3 -g a.cpp -o a
But the resulting disassembly is like this:
0x0000000000400480 <+0>: mov (%rsi),%rsi 0x0000000000400483 <+3>: mov $0x400628,%edi 0x0000000000400488 <+8>: mov $0x22d,%ecx 0x000000000040048d <+13>: repz cmpsb %es:(%rdi),%ds:(%rsi) 0x000000000040048f <+15>: seta %al 0x0000000000400492 <+18>: setb %dl 0x0000000000400495 <+21>: sub %edx,%eax 0x0000000000400497 <+23>: movsbl %al,%eax 0x000000000040049a <+26>: retq
Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD
strcmp
, or is it a nonsensical idea for some reason? -
user1095108 over 9 yearsCan you show some optimized mac disassembly for
strcmp
? -
user1095108 over 9 yearsI've tried immediately to compile with
-msse4.2
and the same machine code is generated. Good to know I wasn't completely wrong, that it could be done, though. -
user1095108 over 9 yearsI've used lex and bison before and they generate a lot of code. It is (usually) valid c++, but efficient? I somehow doubt it. Anyhow parsing was not the relevant topic in my question.
-
phuclv over 9 yearsusing a lenght-prefix string type like Pascal or std::string you can determine if it has passed the string termination
-
Surt over 9 yearsbetter performance => lower energy use => longer battery duration => happy environmental conscious customer.
-
harold over 9 yearsIt doesn't have to ensure that no memory accesses happen over the end of the string. It only has to ensure that no memory accesses happen on pages that the string is not at least partly on, and that's much easier.
-
Surt over 9 years@harold, the limits are even more loose, you just have to ensure that there is anything at all after the last address. Using x-aligned memory for the strings solves the problem without checks.
-
harold over 9 years@Surt fair enough, but that doesn't change much does it? It only really knows about that one string, unless it's in statically allocated memory.
-
Surt over 9 yearsComparing 16 bytes at a time is a real win, even if you have to check for both null and the actual string. The beauty of it is that once you have made the code you can use it "forever" and benefit in new programs.
-
Surt over 9 years@harold, if your allocator always return aligned memory to strings the check becomes superfluous. (and I was agreeing with you :))
-
Z boson over 9 years@Surt, can you say a little more of what you mean? Are you saying that the generic version can be done better or just that the SSE4.2 version is the right idea?
-
Z boson over 9 yearsYeah, according to Agner Fog when he looked into this the MacOS X versions were pretty good but the Linux and Windows versions were inefficient. So probably the MacOS X assembly is interesting.
-
Matthieu M. over 9 years@Surt: For plain
char const*
though, you know not their alignment (upfront); it does not matter though, the first step will simply be to align on a x-bytes boundary before performing the SSE specialized algorithm -
Z boson over 9 yearsThe size of the string is not necessary to know. Reading 15 extra bytes is not a problem for stack allocated arrays. For statically allocated arrays there is a way to get around this by adding a 16 byte .bss section after the .data section.
-
Z boson over 9 yearsIt's only limited by memory bandwidth if the array does not fit in the cache. You could use
memcpy
orstrcmp
in a tight inner loop not limited by memory bandwidth. -
DevSolar over 9 years@Zboson: And how do you know what you're comparing is not heap-allocated, and at the very end of the page, making any access beyond the
'\0'
an immediate pagefault? How do you know the system is, and ever will be, using page-based memory protection in the first place? -
Z boson over 9 years@DevSolar, I don't know. See Section "1.7 String instructions and safety precautions" in the asmlib manual. Agner discusses this in detail. For the heap he writes "It is recommended to allocate sufficient heap space if you are using dynamically allocated strings. A safer and more efficient solution is to allocate a memory pool of sufficient size and store multiple strings in the same memory pool". So GCC to be safe can't assume this but that does not mean faster solutions don't exist which can be used in most cases. You need to know when to be careful.
-
harold over 9 years@DevSolar you can safely make those assumptions, and you can't go onto the next page with aligned reads.
-
supercat over 9 yearsI would think the non-SSE loop could be improved by subtracting one pointer from the other before the start and then using an indexed addressing mode, or does that optimization not help on newer processors?
-
Z boson over 9 yearsAlthough what you say sounds reasonable you don't provide any evidence that " But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison." In fact when I look at the preamble and epilogue of the main loops of the SSE4.2 and the generic function in the asmlib they are almost identical. I would not be surprised if the SSE4.2 version is never slower than the generic version even for arrays less than or equal to 16 bytes.
-
Z boson over 9 years@supercat, I don't know. From theory one would have to look at the latency and throughput and how many cycles it takes. For example if one port always needs two cycles and the rest one then adding another instruction to another port might not make a difference. This could be checked with IACA. But I have never profiled
strcmp
so I only know what I read. -
phuclv over 9 yearsstd::string maybe inefficient in some ways but they're good for determining string length
-
supercat over 9 years@Zboson: I wasn't aware of the SSE 4.2 additions; string-compare code could benefit from earlier SSE functions but not as greatly. If library code were being compiled for 4.2 only, I could see the string functions as a potential "unconditional win", but if code had to check for their availability before executing them such a check would be a 100% loss on processors which don't support them (which I understand are still quite numerous) and even on processors which do might not come out ahead when strings differ in the first character.
-
Z boson over 9 yearsUsing a CPU dispatcher the
strcmp
function only has to check the CPUID the first time it's called. Every call after that will jump directly to the SSE4.2 or the generic version. This is how the asmlib does it. So there is only an overhead on the first call. You could also call an init function for a library with all function you are replacing which makes all the function pointers point to the appropriate version. -
supercat over 9 years@Zboson: A CPU dispatcher would add an indirect jump; I know that's a much smaller penalty than it used to be, but I don't know how small. Further, strcmp() is small enough that an aggressive in-lining optimizer might possibly tackle it (if a header were defined such that it could do so). See edit above.
-
Z boson over 9 yearsThat's a good point. Actually, GCC already inlines
strcmp
in the OPs example. Like I said in my answer, it would be interesting to profile and compare the inline, glibc, and asmlib methods. I'm not going to do it anytime soon though... -
kuroi neko over 9 yearsThe usual crowd of C++ zealots is barking at the blasphemator... Surt comment is irrelevant in the extreme ; you should stay in your marketing office and let the programmers do the talking, pal. As for you, Mr. OP, if you're afraid of the code generated by lexx, you should be terrified by SIMD instructions. They are much more difficult to master (I mean using them is easy enough, but writing code that actually runs better is another story entierely).
-
Peter Cordes about 9 yearswww-ssl.intel.com/content/www/us/en/processors/… (The optimization manual) spends a fair bit of time on memcpy implementation techniques, and performance on Ivy Bridge and later CPUs where
REP STOSB
triggers a microcoded high-performancememset
, but has higher startup overhead than a 128b or 256b wide SSE/AVX implementation. See section 3.7.5, or search for memcpy. There's a CPUID feature bit for ERMSB (Enhanced Rep Movsb and StosB). -
Paul R over 8 yearsWhile this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
-
Peter Cordes almost 8 yearsYou can and should make the godbolt link go directly to your code on godbolt, instead of expecting people to copy/paste for you. SSE2 is portable to any modern x86. It's only a mess if you write messy code.
-
Roman almost 8 years@PeterCordes I've tried to make a short link to my code, but I just couldn't make one. I've always ended up with a 200-symbol link that would take half of the post, and Google's URL shortener didn't help either, it just doesn't recognize resulting URL as a valid.
-
Peter Cordes almost 8 yearsOnly use godbolt short-links for comments, because full links can't rot. When there's no relevant char-limit, just make full links that don't display in the text of your post. (godbolt shortlinks already use goo.gl internally. Oh that's weird, I think the shortlink button might be broken. I hardly ever use it.)
-
Peter Cordes almost 8 yearsIt doesn't look like g++ actually auto-vectorized your code at
-O3
(which includes-ftree-vectorize
). I only see apcmpeqd
, which it uses to generate a vector of all-ones for storing to the stack. I see a lot of byte-at-a-time copying and comparing, and testing bitfields. If that's faster than the builtinstrcmp
, I wonder just how bad it is... -
Roman almost 8 years@PeterCordes Yep, gcc starts to mess up the code badly at -O3.
-
Peter Cordes almost 8 yearsWow, that's weird. I've never seen gcc auto-vectorize something with
-O2 -ftree-vectorize
, but not at-O3
. That's not a good sign for the quality / efficiency of your code. There weren't any high-level comments in it documenting the overall approach, so I didn't really try to read all that mass of templated C++. The vectorized asm output looks like it's emulatingPMOVMSKB
after the vectorpcmpeqb
by storing and then doing byte loads + shifts. That's insanely horrible :( -
Roman almost 8 years@PeterCordes I have updated link to Compiler Explorer in the post with my command-line, please check it out.
-
Peter Cordes almost 8 yearsWhy did you leave out
-Wall -Wextra
? But yeah, that's the asm I was commenting on in my last comment. Vectorized compare, but then packing a bit from each compare-result byte one at a time into a GP register. -
Roman almost 8 yearsLet us continue this discussion in chat.
-
Roman almost 8 years@PeterCordes I implemented a slightly different micro-benchmark and changed the way registers were compared to zero. Now main vectorized loop consists of SSE instructions almost entirely (few test instructions present). Could you check the new code, please?
-
Peter Cordes almost 8 yearsIt looks ok, but the horizontal OR down to a single byte takes a lot of steps. If you could horizontal OR down to 4 or 8 bytes, and check those as an int64_t, the compiler could shorten that chain of
vpsrldq
/vpor
dramatically. (just one thenvmovq rax,xmm0
). (BTW, on godbolt you should probably use-march=haswell
or something instead of-march=native
. Because that will change depending on what hardware godbolt.org is running on. Note that at the moment you're targeting AVX2 because of-march=native
. It doesn't look like your code uses any 256b integer instructions, though.) -
Roman almost 8 years@PeterCordes '-march=haswell' doesn't change anything in the assembly. Auto-vectorizer works on arrays, because my "register" is a fixed-size 16-byte array it can't use AVX. I tried to compact result in one register and then do a single compare, but it ended in a massive code bloat. You can check out results here, modified version on the right part: Screenshot
-
Peter Cordes almost 8 yearsYes, godbolt currently runs on a Haswell server, so
-march=haswell
is the same as-march=native
now. But in 5 years, it might include AVX512. And yes, your code is using the 3-operand non-destructive AVX versions of SSE2 instructions. That's whyvpsrldq
can put its output in a different register, instead of shifting in-place. (Avoiding a lot ofmovdqa
insns). You might want to use-mtune=haswell
to tune for Haswell without making code that doesn't work on older CPUs. -
Peter Cordes almost 8 yearsThat sucks when gcc decides to widen byte vectors to quadwords with
pmovzx
. :/ I'm not sure how best to write code that would get gcc to emit onevpsrldq
/vpor
and then avmovq rax,xmm0
/test rax,rax
, but that would get the job done more efficiently than ORing all the way down to a single byte and extracting it withvpextrb
. Maybe use aunion { char b[16]; int64_t q[2]; };
or something, so you can access a range of elements? -
Roman almost 8 years@PeterCordes I have inserted a type cast to interpret simd-register as an array of CPU words (64-bit words for 64-bit CPU). Compiler did a pretty neat code cleanup, only 3 instructions to test for zero now and zero bit-shifting. Updated links to the code, so you could check this out.
-
Peter Cordes almost 8 yearsIs it faster? The vector loop is now huge, AFAICT, from
.L5
to thejmp .L5
. There's some scalar stuff inside that it jumps over, and maybe 2 unrolled iterations (one just after .L5 and one just after .L7). It's doing the 64bit OR with amovq
and a store/reload to L1. I think it's doing so much storing/reloading that it's going to be a problem for throughput. (The latency isn't very important for strings with the only difference very far from the front, since checking each block of 16B should be a separate dep chain. The extra latency does make the final branch mispredict worse.) -
Peter Cordes almost 8 yearsAlso, be careful with type-punning by
reinterpret_cast
ing a pointer. The language's aliasing rules do guarantee thatchar*
can alias anything, which is why it works in this case. This is why I recommended a union (because that is guaranteed to always work between any types in GNU C++, although still not in ISO C++. I think the ISO rules are the same in C and C++, and the only "safe" way to type-pun is with memcpy (or otherchar*
-based method. But nobody wants that so people use unions in practice.)) -
Roman almost 8 years@PeterCordes honestly, I'am struggling with benchmark now, always getting strange results. The new version seems to be a little bit faster. But I think we need a better micro-benchmark to see real results.
-
Peter Cordes almost 8 yearsYour benchmark is still meaningless because you only time one call to each. Warm-up effects will dominate, especially for the library function (because of lazy dynamic linking). Even cache-misses
chrono::nanoseconds
are probably significant. You need to create a loop which actually makes multiple calls to the same compare function with the same args many times. I'm not interested enough to do it for you or dig up a benchmark framework you could use, but you should watch this presentation to learn more about benchmarking / profiling: youtube.com/watch?v=nXaxk27zwlk -
Roman almost 8 years@PeterCordes I actually did what you have wrote, I'am not that stupid to benchmark functions once without a reason. It is much harder to make a decent micro-benchmark with multiple calls since loop that calls the same function with the same arguments gets optimized away. I tried to solve that with a new variable, bitset that contains results of all of the 32 runs, but then results are wrong: my function gets 20-30 times slower than native strcmp and it's analogs. Then I added flush_cache() call after each iteration of the loop. After that all functions had equal run time.
-
Peter Cordes almost 8 yearsI was commenting based on the
main()
that's still included in your Godbolt link. I tried it on my desktop and got ~300 vs. ~900, IIRC. I didn't look at coliru because running it on an online IDE is no good when you're measuring wall-clock time withchrono::high_resolution_clock
(rather than "user" CPU time) on a non-idle system. -
Peter Cordes over 6 yearsThis is a bogus argument. Optimized C library
strcmp
can use SIMD. It's harder to do safely with two pointers that could be misaligned relative to each other, though. See Is it safe to read past the end of a buffer within the same page on x86 and x64?. Answer: yes in ASM, dicey in C because of UB for reading outside objects. The compiler is emitting asm directly so it knows what is safe. -
Peter Cordes over 6 years
repz cmpsb
is not terrible for very short strings, and saves a call, but I don't think it's a good choice if one of the strings is known to be very long. -
Peter Cordes over 6 yearsA byte-at-a-time loop can't be optimal! glibc's sse2
strcmp
checks for page-crossing and then uses unaligned loads. It is of course fully safe, never reading from a page that doesn't contain any part of either string, because anything else would be unacceptable in a standard library. Having even less overhead is certainly possible if you can arrange to skip those checks, but it looks pretty reasonable. It usespcmpeqb
andpminub
to check for non-match or0
(end of string). -
Peter Cordes almost 2 yearsBTW, Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? is basically the same GCC bug, inlining slow
repe scasb
at-O1
which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions. -
Peter Cordes almost 2 yearsBTW, Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? is basically the same GCC bug, inlining slow
repe scasb
at-O1
which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions. (Which usepcmpeqb
or the AVX2 version, notpcmpistri
.pcmpistri
is not faster than SSE2 for strlen or strcmp.)