strstr faster than algorithms?

15,108

Solution 1

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

(The reason I think this is likely is that programmers love to implement such things.)

Solution 2

Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.

However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.

Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.

strstr() is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like "\r\n" in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.

Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled strlen()/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.

I took that idea and came up with a strstr() that beats glibc's strstr() for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:

BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.

For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.

Solution 3

Without seeing your code, it's hard to say exactly. strstr is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.

Use your debugger and step through the disassembly of strstr -- you'll probably find some interesting things in there.

Solution 4

Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.

You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.

Share:
15,108

Related videos on Youtube

Josh
Author by

Josh

Updated on June 08, 2022

Comments

  • Josh
    Josh almost 2 years

    I have a file that's 21056 bytes.

    I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars.

    I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr and benchmarked each one.

    What I'm wondering is, each time the strstr outperforms all the other algorithms. The only one that is faster sometimes is BM.

    Shouldn't strstr be the slowest?

    Here's my benchmark code with an example of benchmarking BM:

    double get_time()
    {
        LARGE_INTEGER t, f;
        QueryPerformanceCounter(&t);
        QueryPerformanceFrequency(&f);
        return (double)t.QuadPart/(double)f.QuadPart;
    }
    
    before = get_time();
    BM(token, strlen(token), buffer, len);
    after = get_time();
    printf("Time: %f\n\n", after - before);
    

    Could someone explain to me why strstr is outperforming the other search algorithms? I'll post more code on request if needed.

    • Konrad Rudolph
      Konrad Rudolph over 12 years
      The error is in your code. At least Horspool should consistently outperform strstr – KMP may actually be slower. But since you didn’t post your code, we can’t help you. That said, you can chose your data deliberately to make the naive search win so the choice of input data is also relevant.
    • Blindy
      Blindy over 12 years
      Are you seriously benchmarking searching for an 80 byte string in a 20k string?... Your sample size is so small you could do it by hand!
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    It’s almost guaranteed that strstr uses the naive algorithm. It performs well enough, and, this is the important part, it’s general purpose. KMP, for instance, sometimes performs much worse so it certainly isn’t used.
  • Josh
    Josh over 12 years
    Huh, always thought strstr used some sort of brute force.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @Josh It does, don’t be misled.
  • Piotr Praszmo
    Piotr Praszmo over 12 years
    glibc's implementation. It's not assembler but I'm sure it's heavily optimized.
  • wildplasser
    wildplasser over 12 years
    IIRC Glibc has optimised asm functions for popular platforms and functions, and (more or less) naive fall-back implementations in C.
  • TonyK
    TonyK over 12 years
    @Konrad: Look at Banthar's link. It supports my claim. And what's this about KMP sometimes performing much worse than the naive algorithm? I find that impossible to believe. Can you provide a reference?
  • TonyK
    TonyK over 12 years
    @Josh: It doesn't, don't be misled!
  • Nemo
    Nemo over 12 years
    TonyK: I suggest editing your answer to include @Banthar's link. Makes it obvious that you are right (well, it looks more like BM than KMP, but still).
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @TonyK KMP has good theoretical characteristics in terms of the number of comparisons but in practice it’s slower than the naive search since its operations are more complex. This is well known but if you need a reference, consider the standard work on this matter, Flexible Pattern Matching, page 1 by Navarro and Raffinot (the link is buggy, sorry – try pressing “Previous” and “Next” in the search results to get it to display correctly). Incidentally, this also shows that BM is in practice slower than Horspool.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @Banthar Colour me surprised. I had looked at multiple strstr implementations (I’ve worked extensively on string search implementations over the last few years) and every implementation I saw was the naive search – which is a good choice, by the way (see above). I’m also surprised that they used BM instead of Horspool for long needles since the latter is known to perform better on average in typical situations (again, see above).
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    The effects of 4-byte-comparisons and SSE on string search are quite limited, unfortunately since there exist linear dependencies between the elements. In particular, you can’t just compare in non-overlapping 4-byte blocks: even if you compare 4 bytes at a time, you still need to compare overlapping blocks, not gaining much.
  • Prof. Falken
    Prof. Falken over 12 years
    TonyK, I agree with your answer, but for it to be more futureproof, it should not mention what kind of technique strstr() is implemented. Just something "likely to be very fast on the platform of choice".
  • TonyK
    TonyK over 12 years
    @Amigable: OK, I changed it slightly.
  • Prof. Falken
    Prof. Falken over 12 years
    And that tipped the scale. +1 I like this answer very much. :)
  • TonyK
    TonyK over 12 years
    @Konrad: When you said "much worse", I thought you meant "worse than O(mn)". I agree that naive search is likely to be faster than KMP for small strings. But then a good implementation will use a naive algorithm for such strings anyway.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @TonyK It will be faster for small needles, regardless of the length of the haystack – this is important, because it’s the most common use-case by far: your text may be very large, but as long as the needle is relatively small, naive search will be faster than KMP.
  • Ankit Roy
    Ankit Roy over 12 years
    @Konrad: You're still wrong, it's not BM -- as it says on line 1 it uses the Two-Way Algorithm, which amazingly takes only linear time and constant space, unlike BM which requires space proportional to the needle size, or BMH which is not linear-time on some inputs: igm.univ-mlv.fr/~lecroq/string/node26.html
  • Ankit Roy
    Ankit Roy over 12 years
    @Konrad: When line 50 mentions the "bad character shift table", that should tell you that the sentence really means it's even more like BMH than BM (since both use that table, but BMH uses only that table).
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @j_random_hacker Ah, so it does. I had only skimmed the code itself and surmised that two-way would only be used for short needles. This is indeed a sensible choice although I still think Horspool could be faster in many realistic scenarios (and Navarro & Raffinot also seem to think this). But you are wrong about the constant space: two-way does need space linear in the size of the needle, otherwise its worst case is quadratic, not linear! And the glibc implementation also uses linear space for large needles.
  • Ankit Roy
    Ankit Roy over 12 years
    @Konrad: Where do you get the idea that it needs quadratic time from? Also, the glibc implementation needs O(alphabet_size) space for large needles, not space linear in the size of the needle.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @j_random_hacker I’ll shut up now.
  • Ankit Roy
    Ankit Roy over 12 years
    @Konrad: Ah, I see the algorithm description was worded a bit confusingly on the page I linked. But it is indeed a weakness-free algorithm: constant-time and linear-space. That it exists should be better known and appreciated! :) What's more, it's right where it should be: in the library implementation of strstr(). :)
  • Adam Rosenfield
    Adam Rosenfield over 12 years
    @Konrad: strstr can basically be implemented as a loop around strchr and strcmp/memcmp (with memcmp requiring an initial strlen). strchr can be implemented with some bit-twiddling tricks and doing 4-byte loads. memcmp can load 4 bytes at a time from each of the two comparands and compare those words. If the two operands don't have the same alignment, you keep track of two words at a time and bit-twiddle them to get the misaligned word to compare against the other string. See sysdeps/x86_64/memcmp.S in glibc, for example.
  • Ankit Roy
    Ankit Roy over 12 years
    Whoops, I meant "linear-time and constant-space"!
  • Mischa
    Mischa over 12 years
    @Konrad: you gain something comparing the first 4 bytes of a pattern against a 4byte sliding window, shifting and inserting the next target string byte into the bottom of a 32bit word. It's not a much as you'd wish, since mixing 8 and 32 bit instructions on the same register stalls the instruction pipeline. SSE's a different story: you can compare 16 bytes from the target string with the first byte of the pattern, replicated 16 times. This seriously rocks. See my post below re an article with more details.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    Have you tried submitting a patch to the glibc? Of course, this requires careful benchmark over a wide range of different inputs but your above explanation seems sound.
  • mikew
    mikew about 11 years
    so given the cache optimization, is it a waste of time to try to outperform the naive search with Horspool?
  • user626528
    user626528 about 11 years
    @Mischa, Any advise how to compile this code for windows? I have no idea what is ffs and where to get it.
  • Mischa
    Mischa almost 7 years
    Wow, so sorry I never answered that. For the record, the MSVC equivalent is _BitScanForward(), but it's a bit clunky. I wrap it as (pardon the code in comments please, O Editor) static inline uint32_t intlowz(uint32_t x) { uint32_t pos; return x ? (_BitScanForward(&pos, x), pos) : 32; }