Regex for "AND NOT" operation

71,442

Solution 1

This will match any character that is a word and is not a p:

((?=[^p])\w)

To solve your example, use a negative look-ahead for "My" anywhere in the input, ie (?!.*My):

^(?!.*My)((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern)

Note the anchor to start of input ^ which is required to make it work.

Solution 2

I wonder why people try to do complicated things in big monolithic regular expressions?

Why can't you just break down the problem into sub-parts and then make really easy regular expressions to match those individually? In this case, first match \w, then match [^p] if that first match succeeds. Perl (and other languages) allows for constructing really complicated-looking regular expressions that allows you to do exactly what you need to do in one big blobby-regex (or, as it may well be, with a short and snappy crypto-regex), but for the sake of whoever it is that needs to read (and maintain!) the code once you've gone you need to document it fully. Better then to make it easy to understand from the start.

Sorry, rant over.

Solution 3

After your edits, its still the negative lookahead, but with an additional quantifier.

If you want to ensure that the whole string does not contain "My", then you can do this

(?!.*My)^.*$

See it here on Regexr

This will match any sequence of characters (with the .* at the end) and the (?!.*My).* at the beginning will fail when there is a "My" anywhere in the string.

If you want to match anything that si not exactly "My" then use anchors

(?!^My$).*

Solution 4

So after looking through these topics on RegEx's: lookahead, lookbehind, nesting, AND operator, recursion, subroutines, conditionals, anchors, and groups, I've come to the conclusion that there is no solution that satisfies what you're asking for.

The reason why lookahead doesn't work is because it fails in this relatively simple case:

Three words without My included as one.

Regex:

^(?!.*My.*)(\b\w+\b\s\b\w+\b\s\b\w+\b)

Matches:

included as one

The first three words fail to match because My happens after them. If "My" is at the end of the entire string, you'll never match anything because every lookahead will fail because they will all see that.

The problem appears to be that while lookahead has an implicit anchor as to where it begins its match, there's no way of terminating where lookahead ends its search with an anchor based upon the result of another part of the RegEx. That means you really have to duplicate all of the RegEx into the negative lookahead to manually create the anchor you're after.

This is frustrating and a pain. The "solution" appears to be use a scripting language to perform two regex's. One on top of the other. I'm surprised this kind of functionality isn't better built into regular expression engines.

Share:
71,442
Joshua Honig
Author by

Joshua Honig

I'm a software developer in Grand Rapids, MI. Before becoming a full-time developer I worked as both an external and internal IT Auditor, and as a business intelligence guy for a major retailer. For a couple of years I participated on StackOverflow and the MSDN forums as "jmh_gr".

Updated on May 14, 2020

Comments

  • Joshua Honig
    Joshua Honig almost 4 years

    I'm looking for a general regex construct to match everything in pattern x EXCEPT matches to pattern y. This is hard to explain both completely and concisely...see Material Nonimplication for a formal definition.

    For example, match any word character (\w) EXCEPT 'p'. Note I'm subtracting a small set (the letter 'p') from a larger set (all word characters). I can't just say [^p] because that doesn't take into account the larger limiting set of only word characters. For this little example, sure, I could manually reconstruct something like [a-oq-zA-OQ-Z0-9_], which is a pain but doable. But i'm looking for a more general construct so that at least the large positive set can be a more complex expression. Like match ((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern) except when it starts with "My".

    Edit: I realize that was a bad example, since excluding stuff at the begginning or end is a situation where negative look-ahead and look-behind expressions work. (Bohemian I still gave you an upvote for illustrating this). So...what about excluding matches that contain "My" somewhere in the middle?...I'm still really looking for a general construct, like a regex equivalent of the following pseudo-sql

    select [captures] from [input]
    where (
        input MATCHES [pattern1]
        AND NOT capture MATCHES [pattern2]
    )
    

    If there answer is "it does not exist and here is why..." I'd like to know that too.

    Edit 2: If I wanted to define my own function to do this it would be something like (here's a C# LINQ version):

    public static Match[] RegexMNI(string input, 
                                   string positivePattern, 
                                   string negativePattern) {
        return (from Match m in Regex.Matches(input, positivePattern)
                where !Regex.IsMatch(m.Value, negativePattern)
                select m).ToArray();
    }
    

    I'm STILL just wondering if there is a native regex construct that could do this.

  • Donal Fellows
    Donal Fellows over 12 years
    Ah yes, the Zawinski effect, whereby using REs expands the number of problems. (My favorite was when someone asked for an RE to accept valid IEEE doubles that had been written into an XML document…)
  • Joshua Honig
    Joshua Honig over 12 years
    The reason I want to match things in one go is that I want to capture and operate on numerous captures in an input string, such as finding and reformatting declarations that match a certain pattern in a few hundred lines of code. I could toss out regex altogether and go parsing character-by-character...but if there's good power tool might as well use it!
  • Bohemian
    Bohemian about 11 years
    Edited to change the negative look ahead to assert "My" doesn't appear anywhere in the input (previously it only checked for My at the start.
  • horta
    horta almost 9 years
    The OP makes it clear that he's after "My" not being in the matching expression he found. Your negative lookahead searches the entire string input rather than the subset. Really he's wanting to pipe one regex through another regex which doesn't seem possible without a script as far as I know. Any thoughts on how to solve this without a script or without making the lookahead as complex as the main regex pattern?
  • escape-llc
    escape-llc over 6 years
    Programming languages are a different class of grammars (context free) than what regular expressions recognize (recursively enumerable), so be careful...