Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*')

38,644

Solution 1

Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:

  • only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
  • only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
  • only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)

The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.

Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).

See also section about limitations of look-behind assertions on Regular-Expressions.info.

Solution 2

Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.

This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..

But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...

Example:

string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'

matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag

Solution 3

What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group

(?<=this\sis\san)(?:\s*)example

That make it a fixed length look behind, so it should work.

Share:
38,644

Related videos on Youtube

Noel De Martin
Author by

Noel De Martin

https://noeldemartin.com/now #SOreadytohelp

Updated on August 06, 2020

Comments

  • Noel De Martin
    Noel De Martin over 3 years

    I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:

    (?<=this\sis\san\s*?)example
    

    What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?

    I also tried those two and they work correctly, but don't fulfill my needs:

    (?<=this\sis\san\s)example
    this\sis\san\s*?example
    

    I am using this site to test my regular expressions: http://gskinner.com/RegExr/

    • Rich
      Rich about 12 years
      This needs a tag that identifies the language or environment where you use them. .NET's regular expressions handle this without a problem.
    • noob
      noob about 12 years
      Notice! If your regex would work like you want it will also match example from this: this is anexample. So if you don't want that you should remove the ?
    • Rich
      Rich about 12 years
      micha: They should probably just change the * to a +. Removing the ? has no effect in that regard. But indeed, *? as a quantifier is useless and unnecessary in this case as there isn't any more whitespace to match after that, so \s*? is equivalent to \s*.
  • Rich
    Rich about 12 years
    It's only the lookbehind that's problematic. Lookahead can be anything in all regex engines that support it.
  • noob
    noob about 12 years
    It's the same like (?<=this\sis\san)\s*?example that means that it also match the spaces and for your information (?: ) makes the process slower.
  • Rich
    Rich about 12 years
    micha, I'd worry more about the matching part in that case than about performance. I get on average 0.02451781 ms with the non-capuring group and 0.02370844 ms without it. I don't think that's a significant difference.
  • Bohemian
    Bohemian about 12 years
    @micha No. It is not the same. It's a non-capturing group. My regex only matches example (without the leading spaces), but your example includes leading spaces
  • akostadinov
    akostadinov over 9 years
    this works with ruby 2.x but fails with 1.9 and jruby 1.7.x; original comment: good one, I'm surprised I never knew this feature. Learn to format code in the editor and you'll be priceless
  • Abraham Murciano Benzadon
    Abraham Murciano Benzadon almost 7 years
    This regex will match any preceding spaces. eg this is an[ example]. (square brackets represent a match). Just because it is in a non-capturing group, doesn't mean it isn't matched. It just means it isn't captured in a group which would normally be captured in normal brackets. The right way to do this would be using \K like @Leon said
  • Josh Withee
    Josh Withee about 6 years
    In my answer to this question, I have listed some strategies/workarounds after I ran into this limitation on negative lookbehinds. Hope it can help some others too!
  • alstr
    alstr almost 4 years
    This doesn't work. Leading spaces are included in the match. Just copy and paste it in regex101.com.