Most efficient way to check if $string starts with $needle in perl

performance perl string-matching

38,904

Solution 1

rindex $string, $substring, 0

searches for $substring in $string at position <=0 which is only possible if $substring is a prefix of $string. Example:

> rindex "abc", "a", 0
0
> rindex "abc", "b", 0
-1

Solution 2

How important is this, really? I did a number of benchmarks, and the index method averaged 0.68 microseconds per iteration; the regex method 1.14μs; the substr method 0.16μs. Even my worst-case scenarios (2250-char strings that were equal), index took 2.4μs, regex took 5.7μs, and substr took 0.5μs.

My advice is to write a library routine:

sub begins_with
{
    return substr($_[0], 0, length($_[1])) eq $_[1];
}

and focus your optimization efforts elsewhere.

UPDATE: Based on criticism of my "worst-case" scenario described above, I ran a new set of benchmarks with a 20,000-char randomly-generated string, comparing it against itself and against a string that differed only in the last byte.

For such long strings, the regex solution was by far the worst (a 20,000 character regex is hell): 105μs for the match success, 100μs for the match failure.

The index and substr solutions were still quite fast. index was 11.83μs / 11.86μs for success/failure, and substr was 4.09μs / 4.15μs. Moving the code to a separate function added about 0.222±0.05μs.

Benchmark code available at: http://codepaste.net/2k1y8e

I do not know the characteristics of @Stephane's data, but my advice stands.

38,904

Author by

Stephane Chazelas

Updated on November 05, 2021

Comments

Stephane Chazelas over 2 years
Given two string variables $string and $needle in perl, what's the most efficient way to check whether $string starts with $needle.
- $string =~ /^\Q$needle\E/ is the closest match I could think of that does what is required but is the least efficient (by far) of the solutions I tried.
- index($string, $needle) == 0 works and is relatively efficient for some values of $string and $needle but needlessly searches for the needle in other positions (if not found at the start).
- substr($string, 0, length($needle)) eq $needle should be quite simple and efficient, but in most of my few tests is not more efficient than the previous one.
Is there a canonical way to do that in perl which I wouldn't be aware of or any way to optimise any of the above solutions?

(in my particular use case, $string and $needle are going to be different in each run, so precompiling a regexp is not an option).

Example of how to measure the performance of a given solution (here from a POSIX sh):
```
string='somewhat not so longish string' needle='somew'
time perl -e '
  ($n,$string,$needle) = @ARGV;
  for ($i=0;$i<$n;$i++) {

    index($string, $needle) == 0

  }' 10000000 "$string" "$needle"
```
With those values, index() performs better than substr()+eq with this system with perl 5.14.2, but with:
```
string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa"
```
That's reversed.
Stephane Chazelas almost 9 years

For the sake of argument, you may assume that that string matching is the critical point in my code and that all other possible optimisations have been done. In that regard, using a function will only decrease performance. But primarily, my hope when asking the question was that there was a better/canonical way to do that in perl.
silbana almost 9 years

For earlier perls, you might want to omit the return statement.
silbana almost 9 years

@ikegami substr seems to be the fastest when the prefix does not match because it limits the extent of the search.
Sue D. Nymme almost 9 years

NOT useless, @ikegami. Half of my benchmark cases were matches, half were match failures.
Peter Cordes almost 9 years

@SueD.Nymme: your posted answer is worded in a way that implies your worst-case test was only matching strings. Clearly the worst-case for index is an extremely long haystack that doesn't contain the needle anywhere, so it has to check all the way to the end. I'd agree with your conclusion, though: just use substr, since we've shown that it's not slower in common cases. It should have a much better worst-case, which is important for resisting DOS attacks (or accidental slowdowns).
Sue D. Nymme almost 9 years

Instead of simply dismissing my benchmark results, you could try to reproduce them.
Peter Cordes almost 9 years

@SueD.Nymme: you're only testing with length($needle) == length($haystack), which is weird. You'd expect to see an even bigger difference between substr and index when index can spend time checking for matches at positions other than 0.
Stephane Chazelas almost 9 years

@PeterCordes, and among the situations where the needle is not found in the string, there are those that are worse than others like the last example in the question where 111 byte-to-byte comparisons at least ((6+5+4+3+2+1)*5+6) are needed for a string of length 34 and needle of length 6. (it might even be the worst case scenario for string/needle this length, which would make for another interesting question here)
Peter Cordes almost 9 years

@StephaneChazelas: some smart regex engines will avoid pitfalls like that by checking if the end of the string matches, to avoid going far into a matching prefix over and over. index could do that, too.
Stephane Chazelas almost 9 years

@PeterCordes, I'm not sure I understand what you mean. In string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa", how would checking the trailing aaaaab against aaaaaa first reduce the number of comparisons needed?
Peter Cordes almost 9 years

@StephaneChazelas: oh you're right, in this case it won't help much if any, since there are more as after every b. So it's only one mismatching character, rather than a matching prefix with mismatches later.
Stephane Chazelas about 5 years

Perfect thanks. It's the feature I wasn't aware of and was looking for. I get similar timings for both test cases in the question, and it's faster than any of the other approaches on both.
Jim Balter almost 5 years

"How important is this, really?" Enough for the OP to ask the question and for you to write benchmarks. That opening question serves no purpose other than to ding the OP for asking, and is in conflict with the rest of the answer. The advice to write a library routine is independent of the question, and actually supports the question, since library routines should strive to be efficient. The most efficient implementation is rindex($_[0], $_[1], 0) == 0, and this unusual usage of rindex can be hidden away in the library routine along with a comment explaining it.