Most efficient way to check if $string starts with $needle in perl
Solution 1
rindex $string, $substring, 0
searches for $substring
in $string
at position <=0 which is only possible if $substring
is a prefix of $string
. Example:
> rindex "abc", "a", 0
0
> rindex "abc", "b", 0
-1
Solution 2
How important is this, really? I did a number of benchmarks, and the index
method averaged 0.68 microseconds per iteration; the regex method 1.14μs; the substr
method 0.16μs. Even my worst-case scenarios (2250-char strings that were equal), index
took 2.4μs, regex took 5.7μs, and substr
took 0.5μs.
My advice is to write a library routine:
sub begins_with
{
return substr($_[0], 0, length($_[1])) eq $_[1];
}
and focus your optimization efforts elsewhere.
UPDATE: Based on criticism of my "worst-case" scenario described above, I ran a new set of benchmarks with a 20,000-char randomly-generated string, comparing it against itself and against a string that differed only in the last byte.
For such long strings, the regex solution was by far the worst (a 20,000 character regex is hell): 105μs for the match success, 100μs for the match failure.
The index
and substr
solutions were still quite fast. index
was 11.83μs / 11.86μs for success/failure, and substr
was 4.09μs / 4.15μs. Moving the code to a separate function added about 0.222±0.05μs.
Benchmark code available at: http://codepaste.net/2k1y8e
I do not know the characteristics of @Stephane's data, but my advice stands.
Stephane Chazelas
Updated on November 05, 2021Comments
-
Stephane Chazelas over 2 years
Given two string variables
$string
and$needle
inperl
, what's the most efficient way to check whether$string
starts with$needle
.$string =~ /^\Q$needle\E/
is the closest match I could think of that does what is required but is the least efficient (by far) of the solutions I tried.index($string, $needle) == 0
works and is relatively efficient for some values of$string
and$needle
but needlessly searches for the needle in other positions (if not found at the start).substr($string, 0, length($needle)) eq $needle
should be quite simple and efficient, but in most of my few tests is not more efficient than the previous one.
Is there a canonical way to do that in
perl
which I wouldn't be aware of or any way to optimise any of the above solutions?(in my particular use case,
$string
and$needle
are going to be different in each run, so precompiling a regexp is not an option).
Example of how to measure the performance of a given solution (here from a POSIX
sh
):string='somewhat not so longish string' needle='somew' time perl -e ' ($n,$string,$needle) = @ARGV; for ($i=0;$i<$n;$i++) { index($string, $needle) == 0 }' 10000000 "$string" "$needle"
With those values,
index()
performs better thansubstr()+eq
with this system with perl 5.14.2, but with:string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa"
That's reversed.
-
Stephane Chazelas almost 9 yearsFor the sake of argument, you may assume that that string matching is the critical point in my code and that all other possible optimisations have been done. In that regard, using a function will only decrease performance. But primarily, my hope when asking the question was that there was a better/canonical way to do that in perl.
-
silbana almost 9 yearsFor earlier
perl
s, you might want to omit the return statement. -
silbana almost 9 years@ikegami
substr
seems to be the fastest when the prefix does not match because it limits the extent of the search. -
Sue D. Nymme almost 9 yearsNOT useless, @ikegami. Half of my benchmark cases were matches, half were match failures.
-
Peter Cordes almost 9 years@SueD.Nymme: your posted answer is worded in a way that implies your worst-case test was only matching strings. Clearly the worst-case for
index
is an extremely long haystack that doesn't contain the needle anywhere, so it has to check all the way to the end. I'd agree with your conclusion, though: just usesubstr
, since we've shown that it's not slower in common cases. It should have a much better worst-case, which is important for resisting DOS attacks (or accidental slowdowns). -
Sue D. Nymme almost 9 yearsInstead of simply dismissing my benchmark results, you could try to reproduce them.
-
Peter Cordes almost 9 years@SueD.Nymme: you're only testing with
length($needle) == length($haystack)
, which is weird. You'd expect to see an even bigger difference betweensubstr
andindex
whenindex
can spend time checking for matches at positions other than0
. -
Stephane Chazelas almost 9 years@PeterCordes, and among the situations where the needle is not found in the string, there are those that are worse than others like the last example in the question where 111 byte-to-byte comparisons at least ((6+5+4+3+2+1)*5+6) are needed for a string of length 34 and needle of length 6. (it might even be the worst case scenario for string/needle this length, which would make for another interesting question here)
-
Peter Cordes almost 9 years@StephaneChazelas: some smart regex engines will avoid pitfalls like that by checking if the end of the string matches, to avoid going far into a matching prefix over and over.
index
could do that, too. -
Stephane Chazelas almost 9 years@PeterCordes, I'm not sure I understand what you mean. In
string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa"
, how would checking the trailingaaaaab
againstaaaaaa
first reduce the number of comparisons needed? -
Peter Cordes almost 9 years@StephaneChazelas: oh you're right, in this case it won't help much if any, since there are more
a
s after everyb
. So it's only one mismatching character, rather than a matching prefix with mismatches later. -
Stephane Chazelas about 5 yearsPerfect thanks. It's the feature I wasn't aware of and was looking for. I get similar timings for both test cases in the question, and it's faster than any of the other approaches on both.
-
Jim Balter almost 5 years"How important is this, really?" Enough for the OP to ask the question and for you to write benchmarks. That opening question serves no purpose other than to ding the OP for asking, and is in conflict with the rest of the answer. The advice to write a library routine is independent of the question, and actually supports the question, since library routines should strive to be efficient. The most efficient implementation is
rindex($_[0], $_[1], 0) == 0
, and this unusual usage ofrindex
can be hidden away in the library routine along with a comment explaining it.