Is regex too slow? Real life examples where simple non-regex alternative is better

java regex performance algorithm string

26,793

Solution 1

I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

Solution 2

I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.

Java regex takes `O(N)` to match `"(?s)^.*+$"`

This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).

This pattern isn't very useful, of course, but consider the next problem.

Java regex takes `O(N)` to match `"(?s)^A.*Z$"`

Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:

 s.startsWith("A") && s.endsWith("Z")

Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.

Java regex takes `O(N)` to match `"(?s)^.*[aeiou]{3}$"`

This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).

I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).

In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.

Test harness

static void testAnchors() {
    String pattern = "(?s)^.*[aeiou]{3}$";
    for (int N = 1; N < 20; N++) {
        String needle = stringLength(1 << N) + "ooo";
        System.out.println(N);
        boolean b = true;
        for (int REPS = 10000; REPS --> 0; ) {
            b &= 
              needle
              //.substring(needle.length() - 3) // try with this
              .matches(pattern);
        }
        System.out.println(b);
    }
}

The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).

Conclusion

It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.

Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.

//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).

I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).

Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.

So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.

Solution 3

In my tests, I found the following:

Using java's String.split method (which uses regex) took 2176ms under 1,000,000 iterations. Using this custom split method took 43ms under 1,000,000 iterations.

Of course, it will only work if your "regex" is completely literal, but in those cases, it will be much faster.

List<String> array = new ArrayList<String>();
String split = "ab";
String string = "aaabaaabaa";
int sp = 0;
for(int i = 0; i < string.length() - split.length(); i++){              
    if(string.substring(i, i + split.length()).equals(split)){
        //Split point found
        array.add(string.substring(sp, i));
        sp = i + split.length();
        i += split.length();
    }
}
if(sp != 0){
    array.add(string.substring(sp, string.length()));
}
return array;

So to answer your question, is it theoretically faster? Yes, absolutely, my algorithm is O(n), where n is the length of the string to split. (I'm not sure what regex would be). Is it practically faster? Well, over 1 million iterations, I saved basically 2 seconds. So, it depends on your needs I guess, but I wouldn't worry too much about backporting all code that uses regex to non-regex versions, and in fact, that might be necessary anyways, if the pattern is very complex, a literal split like this won't work. However, if you're splitting on, say, commas, this method will perform much better, though "much better" is subjective here.

Solution 4

Well, not always but sometimes slow, depends on patterns and implementations.

A quick example, 2x slower than normal replace, but I don't think its that slow.

>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>

View more solutions

26,793

Author by

polygenelubricants

I mostly contributed in [java] and [regex] from February to August of 2010. I work for Palantir Technologies now, so I may not have much time on stackoverflow as I did then. We're currently hiring; you can e-mail me for a referral. A few habits I've developed on the site: I will no longer cast a downvote. It will stay at 54 forever. I don't like to engage in dramas on stackoverflow. If you really need to discuss politics and other non-technical issues with me, bring it to meta. I delete my comments once they've become obsolete I try to revise my answers periodically, so I prefer that you leave comments and feedbacks instead of editing my answers directly.

Updated on September 29, 2020

Comments

polygenelubricants over 3 years
I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.

Note: I would very much appreciate answers that demonstrate the following:
1. a beginner-level regex solution to a non-toy real-life problem that performs horribly
2. the simple non-regex solution
3. the expert-level regex rewrite that performs comparably
Henk Holterman about 14 years

+1, A "Real life example". But only because of the simplicity of "bc". Change the requirement to: replace every sequence of 1 or more 'b' characters and you can no longer use a single lib method.
Svante about 14 years

@Henk Holterman: Yes, but your new example is what regular expressions are there for, while simple, static replacement is not.
BalusC about 14 years

How would regex optimize based on the pattern? Fire another regex on it? ;) The java.util.Pattern source is by the way quite interesting. It uses a parser/interpreter to execute regex.
donut about 14 years

So the regex, in this case, is faster than the simple alternative of using split()
polygenelubricants about 14 years

Unless I'm mistaken, "(?s)^.*$(?<=[aeiou]{3})" should be optimizable to O(1). The way I understand it, in single-line (?s)/Pattern.DOTALL mode, ^.*$ is an instant O(1) match to everything. The lookbehind from the $ anchor is "obviously" a simple suffix matching attempt. I think it's very possible that some sophisticated regex implementation can optimize this to O(1), no?
Christian Semrau about 14 years

Yes it is, mainly because split() uses a regex internally. Even faster than the performant regex (but less readable) is a StringTokenizer: StringTokenizer st = new StringTokenizer(input, ","); for (int i = 0; i < 5; i++) { st.nextToken(); } boolean match = "13".equals(st.nextToken());
Alan Moore about 14 years

As a matter of fact, there's an RFE from 2007 requesting that matches() or find() skip regex matching entirely and simply return true in the case of .*. The submitter hadn't thought it out as far as you have, but I still don't think it's worth the effort. There can be many reasons to reject regexes as a solution, depending on the nature of the project, the tool set, and the capabilities of the programmers; performance is almost never the deciding factor. ref: bugs.sun.com/view_bug.do?bug_id=6565414
polygenelubricants about 14 years

@Alan: what I learned from this exploration is that .* itself isn't as trivial as it looks: "Hello!\nWorld!".matches(".*") is false! It's only true in single-line mode (?s).
Ingo about 14 years

The problem is always the .* One would not use it that way , but instead the s.matcher("^[aeiou]{3}").find() // or was it the other way round? With .* you want to collect n characters into group 0, so it must be O(N)
tar over 7 years

Please don't anything in this answer to parse real-world CSV data because there can be commas in a field, e.g. 1,2,"3,000",4.
Vectorjohn over 6 years

This is a flawed test harness. You're counting the time of compiling the regex in every iteration. The best possible optimizations are not going to be able to optimize that out. Sadly String doesn't have a matches() implementation that accepts a Pattern instance (so you'll have to make one yourself, outside of the loop). Also a nitpick, but none of your examples do "suffix matching". They all match the whole input which is different. E.g. "[aeiou]{3}$" would be a suffix match, yours all have "^.*" in them. I'm not sure if that change would actually make a difference, but it might.
Dici about 6 years

@donut The fastest way to find the nth part, by the way, would be to use String.indexOf(sep, fromIndex) repeatedly in a loop until reaching the nth match. split is slow for this task, in particular when n is close to 0 and the string is extremely long because it has to traverse the entire string AND allocate as many new strings as parts as well as an array to contain them. Very wasteful !
Dici about 6 years

The difference between a 5h job and a 10h job is pretty big. A 2x factor on a very large data set can be very penalizing.
Jochem Kuijpers over 4 years

Your algorithm may be as bad as O(nm) where n is the input string length and m is the split string length. For example you may want to split "aaaa...aaab" on "aaaaab". String equals has a worst case equality operation of O(m) where m is the string length.
Jerry Nixon over 4 years

Split uses Regex internally?
Christian Semrau over 4 years

Good point. String.split() uses regex, but it has a special case for a single-literal-character regex, at least in OpenJDK. So mystring.split(",") uses the fast special case, but mystring.split(", ") (with an additional space) uses a full regex-based split.
frodeborli over 3 years

@joachemkuijpers That is not entirely correct. It would be O((n-m)m), assuming that the .equals() method keeps comparing all characters even if the first character didn't match. Also, I don't know if substring() actually copies the source string, or if it only creates a memory reference under the hood. My guess, is a reference since strings are immutable.
jocull about 3 years

It's hard to derive from this tiny example that the result on a big job would be 2x slower. A good chunk of it may be initialization and wouldn't count for much in the scope of a big job. Benchmarks would reveal more :)