C++ string::find complexity

35,797

Solution 1

Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?

I assume you mean find(), rather than substr() which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).

The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string operations are that size(), max_size(), operator[], swap(), c_str() and data() are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.

The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.

Is that corrected in c++0x?

No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.

If the complexity of current substr is not O(N * M), what is that?

That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.

Solution 2

FYI, The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:

GCC: PR66414 optimize std::string::find https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65

LLVM: https://reviews.llvm.org/D27068

The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.

Solution 3

Where do you get the impression from that std::string::substr() doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr() just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.

You may be referring to std::string::find() which doesn't have any complexity requirements or std::search() which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.

Solution 4

The C++ standard does not dictate the performance characteristics of substr (or many other parts, including the find you're most likely referring to with an M*N complexity).

It mostly dictates functional aspects of the language (with some exceptions like the non-legacy sort functions, for example).

Implementations are even free to implement qsort as a bubble sort (but only if they want to be ridiculed and possibly go out of business).

For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::find of C++11, and none of them specify performance parameters.

Solution 5

Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:

Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet Ʃd = {0; 1; ...; d}, where d >= 2. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is enter image description here
over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4     if P[1..m] == T[s+1..s+m]
5         print “Pattern occurs with shift” s

Proof:

For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1} comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □

Share:
35,797
Farzam
Author by

Farzam

Updated on October 29, 2020

Comments

  • Farzam
    Farzam over 3 years

    Why the c++'s implemented string::find() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x? If the complexity of current find is not O(N * M), what is that?

    so what algorithm is implemented in gcc? is that KMP? if not, why? I've tested that and the running time shows that it runs in O(N * M)