Why is strcmp not SIMD optimized?

c++ sse simd strcmp sse2

10,454

Solution 1

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.

If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.

However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:

http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

Solution 2

GCC in this case is using a builtin strcmp. If you want it to use the version from glibc use -fno-builtin. But you should not assume that GCC's builtin version of strcmp or glibc's implementaiton of strcmp are efficient. I know from experience that GCC's builtin memcpy and glibc's memcpy are not as efficient as they could be.

I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version

compareloop:
        add     rax, 16                ; increment offset
        movdqu  xmm1, [rs1+rax]        ; read 16 bytes of string 1
        pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
        jnbe    compareloop            ; jump if not carry flag and not zero flag

For the generic version he writes

This is a very simple solution. There is not much gained by using SSE2 or anything complicated

Here is the main loop of the generic version:

_compareloop:
        mov     al, [ss1]
        cmp     al, [ss2]
        jne     _notequal
        test    al, al
        jz      _equal
        inc     ss1
        inc     ss2
        jmp     _compareloop

I would compare the performance of GCC's builtin strcmp , GLIBC's strcmp and the asmlib strcmp. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy does not use the builtin version from sizes larger than 8192.

Edit: In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.

Solution 3

When the standard library for C was designed, the implementations of string.h methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.

Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:

int strcmp(char *p1, char *p2)
{
  int idx=0,t1,t2;
  do
  {
    t1=*p1; t2=*p2;
    if (t1 != t2)
    {
      if (t1 > t2) return 1;
      return -1;
    }
    if (!t1)
      return 0;
    p1++; p2++;
  } while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();

While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.

Solution 4

Making an SSE2 version of strcmp was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.

I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.

So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad

Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.

Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.

Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra

Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)

@PeterCordes, thanks for the help.

Solution 5

It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.

Convincing the C++ standard library to use that kind of code is difficult.

View more solutions

10,454

Author by

user1095108

Updated on June 06, 2022

Comments

user1095108 almost 2 years

I've tried to compile this program on an x64 computer:

#include <cstring>

int main(int argc, char* argv[])
{
  return ::std::strcmp(argv[0],
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really long string"
  );
}

I compiled it like this:

g++ -std=c++11 -msse2 -O3 -g a.cpp -o a

But the resulting disassembly is like this:

   0x0000000000400480 <+0>:     mov    (%rsi),%rsi
   0x0000000000400483 <+3>:     mov    $0x400628,%edi
   0x0000000000400488 <+8>:     mov    $0x22d,%ecx
   0x000000000040048d <+13>:    repz cmpsb %es:(%rdi),%ds:(%rsi)
   0x000000000040048f <+15>:    seta   %al
   0x0000000000400492 <+18>:    setb   %dl
   0x0000000000400495 <+21>:    sub    %edx,%eax
   0x0000000000400497 <+23>:    movsbl %al,%eax
   0x000000000040049a <+26>:    retq

Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD strcmp, or is it a nonsensical idea for some reason?

user1095108 over 9 years

Can you show some optimized mac disassembly for strcmp?
user1095108 over 9 years

I've tried immediately to compile with -msse4.2 and the same machine code is generated. Good to know I wasn't completely wrong, that it could be done, though.
user1095108 over 9 years

I've used lex and bison before and they generate a lot of code. It is (usually) valid c++, but efficient? I somehow doubt it. Anyhow parsing was not the relevant topic in my question.
phuclv over 9 years

using a lenght-prefix string type like Pascal or std::string you can determine if it has passed the string termination
Surt over 9 years

better performance => lower energy use => longer battery duration => happy environmental conscious customer.
harold over 9 years

It doesn't have to ensure that no memory accesses happen over the end of the string. It only has to ensure that no memory accesses happen on pages that the string is not at least partly on, and that's much easier.
Surt over 9 years

@harold, the limits are even more loose, you just have to ensure that there is anything at all after the last address. Using x-aligned memory for the strings solves the problem without checks.
harold over 9 years

@Surt fair enough, but that doesn't change much does it? It only really knows about that one string, unless it's in statically allocated memory.
Surt over 9 years

Comparing 16 bytes at a time is a real win, even if you have to check for both null and the actual string. The beauty of it is that once you have made the code you can use it "forever" and benefit in new programs.
Surt over 9 years

@harold, if your allocator always return aligned memory to strings the check becomes superfluous. (and I was agreeing with you :))
Z boson over 9 years

@Surt, can you say a little more of what you mean? Are you saying that the generic version can be done better or just that the SSE4.2 version is the right idea?
Z boson over 9 years

Yeah, according to Agner Fog when he looked into this the MacOS X versions were pretty good but the Linux and Windows versions were inefficient. So probably the MacOS X assembly is interesting.
Matthieu M. over 9 years

@Surt: For plain char const* though, you know not their alignment (upfront); it does not matter though, the first step will simply be to align on a x-bytes boundary before performing the SSE specialized algorithm
Z boson over 9 years

The size of the string is not necessary to know. Reading 15 extra bytes is not a problem for stack allocated arrays. For statically allocated arrays there is a way to get around this by adding a 16 byte .bss section after the .data section.
Z boson over 9 years

It's only limited by memory bandwidth if the array does not fit in the cache. You could use memcpy or strcmp in a tight inner loop not limited by memory bandwidth.
DevSolar over 9 years

@Zboson: And how do you know what you're comparing is not heap-allocated, and at the very end of the page, making any access beyond the '\0' an immediate pagefault? How do you know the system is, and ever will be, using page-based memory protection in the first place?
Z boson over 9 years

@DevSolar, I don't know. See Section "1.7 String instructions and safety precautions" in the asmlib manual. Agner discusses this in detail. For the heap he writes "It is recommended to allocate sufficient heap space if you are using dynamically allocated strings. A safer and more efficient solution is to allocate a memory pool of sufficient size and store multiple strings in the same memory pool". So GCC to be safe can't assume this but that does not mean faster solutions don't exist which can be used in most cases. You need to know when to be careful.
harold over 9 years

@DevSolar you can safely make those assumptions, and you can't go onto the next page with aligned reads.
supercat over 9 years

I would think the non-SSE loop could be improved by subtracting one pointer from the other before the start and then using an indexed addressing mode, or does that optimization not help on newer processors?
Z boson over 9 years

Although what you say sounds reasonable you don't provide any evidence that " But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison." In fact when I look at the preamble and epilogue of the main loops of the SSE4.2 and the generic function in the asmlib they are almost identical. I would not be surprised if the SSE4.2 version is never slower than the generic version even for arrays less than or equal to 16 bytes.
Z boson over 9 years

@supercat, I don't know. From theory one would have to look at the latency and throughput and how many cycles it takes. For example if one port always needs two cycles and the rest one then adding another instruction to another port might not make a difference. This could be checked with IACA. But I have never profiled strcmp so I only know what I read.
phuclv over 9 years

std::string maybe inefficient in some ways but they're good for determining string length
supercat over 9 years

@Zboson: I wasn't aware of the SSE 4.2 additions; string-compare code could benefit from earlier SSE functions but not as greatly. If library code were being compiled for 4.2 only, I could see the string functions as a potential "unconditional win", but if code had to check for their availability before executing them such a check would be a 100% loss on processors which don't support them (which I understand are still quite numerous) and even on processors which do might not come out ahead when strings differ in the first character.
Z boson over 9 years

Using a CPU dispatcher the strcmp function only has to check the CPUID the first time it's called. Every call after that will jump directly to the SSE4.2 or the generic version. This is how the asmlib does it. So there is only an overhead on the first call. You could also call an init function for a library with all function you are replacing which makes all the function pointers point to the appropriate version.
supercat over 9 years

@Zboson: A CPU dispatcher would add an indirect jump; I know that's a much smaller penalty than it used to be, but I don't know how small. Further, strcmp() is small enough that an aggressive in-lining optimizer might possibly tackle it (if a header were defined such that it could do so). See edit above.
Z boson over 9 years

That's a good point. Actually, GCC already inlines strcmp in the OPs example. Like I said in my answer, it would be interesting to profile and compare the inline, glibc, and asmlib methods. I'm not going to do it anytime soon though...
kuroi neko over 9 years

The usual crowd of C++ zealots is barking at the blasphemator... Surt comment is irrelevant in the extreme ; you should stay in your marketing office and let the programmers do the talking, pal. As for you, Mr. OP, if you're afraid of the code generated by lexx, you should be terrified by SIMD instructions. They are much more difficult to master (I mean using them is easy enough, but writing code that actually runs better is another story entierely).
Peter Cordes about 9 years

www-ssl.intel.com/content/www/us/en/processors/… (The optimization manual) spends a fair bit of time on memcpy implementation techniques, and performance on Ivy Bridge and later CPUs where REP STOSB triggers a microcoded high-performance memset, but has higher startup overhead than a 128b or 256b wide SSE/AVX implementation. See section 3.7.5, or search for memcpy. There's a CPUID feature bit for ERMSB (Enhanced Rep Movsb and StosB).
Paul R over 8 years

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
Peter Cordes almost 8 years

You can and should make the godbolt link go directly to your code on godbolt, instead of expecting people to copy/paste for you. SSE2 is portable to any modern x86. It's only a mess if you write messy code.
Roman almost 8 years

@PeterCordes I've tried to make a short link to my code, but I just couldn't make one. I've always ended up with a 200-symbol link that would take half of the post, and Google's URL shortener didn't help either, it just doesn't recognize resulting URL as a valid.
Peter Cordes almost 8 years

Only use godbolt short-links for comments, because full links can't rot. When there's no relevant char-limit, just make full links that don't display in the text of your post. (godbolt shortlinks already use goo.gl internally. Oh that's weird, I think the shortlink button might be broken. I hardly ever use it.)
Peter Cordes almost 8 years

It doesn't look like g++ actually auto-vectorized your code at -O3 (which includes -ftree-vectorize). I only see a pcmpeqd, which it uses to generate a vector of all-ones for storing to the stack. I see a lot of byte-at-a-time copying and comparing, and testing bitfields. If that's faster than the builtin strcmp, I wonder just how bad it is...
Roman almost 8 years

@PeterCordes Yep, gcc starts to mess up the code badly at -O3.
Peter Cordes almost 8 years

Wow, that's weird. I've never seen gcc auto-vectorize something with -O2 -ftree-vectorize, but not at -O3. That's not a good sign for the quality / efficiency of your code. There weren't any high-level comments in it documenting the overall approach, so I didn't really try to read all that mass of templated C++. The vectorized asm output looks like it's emulating PMOVMSKB after the vector pcmpeqb by storing and then doing byte loads + shifts. That's insanely horrible :(
Roman almost 8 years

@PeterCordes I have updated link to Compiler Explorer in the post with my command-line, please check it out.
Peter Cordes almost 8 years

Why did you leave out -Wall -Wextra? But yeah, that's the asm I was commenting on in my last comment. Vectorized compare, but then packing a bit from each compare-result byte one at a time into a GP register.
Roman almost 8 years

Let us continue this discussion in chat.
Roman almost 8 years

@PeterCordes I implemented a slightly different micro-benchmark and changed the way registers were compared to zero. Now main vectorized loop consists of SSE instructions almost entirely (few test instructions present). Could you check the new code, please?
Peter Cordes almost 8 years

It looks ok, but the horizontal OR down to a single byte takes a lot of steps. If you could horizontal OR down to 4 or 8 bytes, and check those as an int64_t, the compiler could shorten that chain of vpsrldq / vpor dramatically. (just one then vmovq rax,xmm0). (BTW, on godbolt you should probably use -march=haswell or something instead of -march=native. Because that will change depending on what hardware godbolt.org is running on. Note that at the moment you're targeting AVX2 because of -march=native. It doesn't look like your code uses any 256b integer instructions, though.)
Roman almost 8 years

@PeterCordes '-march=haswell' doesn't change anything in the assembly. Auto-vectorizer works on arrays, because my "register" is a fixed-size 16-byte array it can't use AVX. I tried to compact result in one register and then do a single compare, but it ended in a massive code bloat. You can check out results here, modified version on the right part: Screenshot
Peter Cordes almost 8 years

Yes, godbolt currently runs on a Haswell server, so -march=haswell is the same as -march=native now. But in 5 years, it might include AVX512. And yes, your code is using the 3-operand non-destructive AVX versions of SSE2 instructions. That's why vpsrldq can put its output in a different register, instead of shifting in-place. (Avoiding a lot of movdqa insns). You might want to use -mtune=haswell to tune for Haswell without making code that doesn't work on older CPUs.
Peter Cordes almost 8 years

That sucks when gcc decides to widen byte vectors to quadwords with pmovzx. :/ I'm not sure how best to write code that would get gcc to emit one vpsrldq / vpor and then a vmovq rax,xmm0 / test rax,rax, but that would get the job done more efficiently than ORing all the way down to a single byte and extracting it with vpextrb. Maybe use a union { char b[16]; int64_t q[2]; }; or something, so you can access a range of elements?
Roman almost 8 years

@PeterCordes I have inserted a type cast to interpret simd-register as an array of CPU words (64-bit words for 64-bit CPU). Compiler did a pretty neat code cleanup, only 3 instructions to test for zero now and zero bit-shifting. Updated links to the code, so you could check this out.
Peter Cordes almost 8 years

Is it faster? The vector loop is now huge, AFAICT, from .L5 to the jmp .L5. There's some scalar stuff inside that it jumps over, and maybe 2 unrolled iterations (one just after .L5 and one just after .L7). It's doing the 64bit OR with a movq and a store/reload to L1. I think it's doing so much storing/reloading that it's going to be a problem for throughput. (The latency isn't very important for strings with the only difference very far from the front, since checking each block of 16B should be a separate dep chain. The extra latency does make the final branch mispredict worse.)
Peter Cordes almost 8 years

Also, be careful with type-punning by reinterpret_casting a pointer. The language's aliasing rules do guarantee that char* can alias anything, which is why it works in this case. This is why I recommended a union (because that is guaranteed to always work between any types in GNU C++, although still not in ISO C++. I think the ISO rules are the same in C and C++, and the only "safe" way to type-pun is with memcpy (or other char*-based method. But nobody wants that so people use unions in practice.))
Roman almost 8 years

@PeterCordes honestly, I'am struggling with benchmark now, always getting strange results. The new version seems to be a little bit faster. But I think we need a better micro-benchmark to see real results.
Peter Cordes almost 8 years

Your benchmark is still meaningless because you only time one call to each. Warm-up effects will dominate, especially for the library function (because of lazy dynamic linking). Even cache-misses chrono::nanoseconds are probably significant. You need to create a loop which actually makes multiple calls to the same compare function with the same args many times. I'm not interested enough to do it for you or dig up a benchmark framework you could use, but you should watch this presentation to learn more about benchmarking / profiling: youtube.com/watch?v=nXaxk27zwlk
Roman almost 8 years

@PeterCordes I actually did what you have wrote, I'am not that stupid to benchmark functions once without a reason. It is much harder to make a decent micro-benchmark with multiple calls since loop that calls the same function with the same arguments gets optimized away. I tried to solve that with a new variable, bitset that contains results of all of the 32 runs, but then results are wrong: my function gets 20-30 times slower than native strcmp and it's analogs. Then I added flush_cache() call after each iteration of the loop. After that all functions had equal run time.
Peter Cordes almost 8 years

I was commenting based on the main() that's still included in your Godbolt link. I tried it on my desktop and got ~300 vs. ~900, IIRC. I didn't look at coliru because running it on an online IDE is no good when you're measuring wall-clock time with chrono::high_resolution_clock (rather than "user" CPU time) on a non-idle system.
Peter Cordes over 6 years

This is a bogus argument. Optimized C library strcmp can use SIMD. It's harder to do safely with two pointers that could be misaligned relative to each other, though. See Is it safe to read past the end of a buffer within the same page on x86 and x64?. Answer: yes in ASM, dicey in C because of UB for reading outside objects. The compiler is emitting asm directly so it knows what is safe.
Peter Cordes over 6 years

repz cmpsb is not terrible for very short strings, and saves a call, but I don't think it's a good choice if one of the strings is known to be very long.
Peter Cordes over 6 years

A byte-at-a-time loop can't be optimal! glibc's sse2 strcmp checks for page-crossing and then uses unaligned loads. It is of course fully safe, never reading from a page that doesn't contain any part of either string, because anything else would be unacceptable in a standard library. Having even less overhead is certainly possible if you can arrange to skip those checks, but it looks pretty reasonable. It uses pcmpeqb and pminub to check for non-match or 0 (end of string).
Peter Cordes almost 2 years

BTW, Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? is basically the same GCC bug, inlining slow repe scasb at -O1 which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions.
Peter Cordes almost 2 years

BTW, Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? is basically the same GCC bug, inlining slow repe scasb at -O1 which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions. (Which use pcmpeqb or the AVX2 version, not pcmpistri. pcmpistri is not faster than SSE2 for strlen or strcmp.)