Javascript regex pattern match multiple strings ( AND, OR ) against single string

39,960

Solution 1

A single regex is not the right tool for this, IMO:

/^(?=.*\bnano)(?=(?:.*\bregulat|.*toxic|(?=.*(?:\brisk\b|\bhazard\b))(?=.*(?:\bexposure\b|\brelease\b))))/i.test(subject))

would return True if the string fulfills the criteria you set forth, but I find nested lookaheads quite incomprehensible. If JavaScript supported commented regexes, it would look like this:

^                 # Anchor search to start of string
(?=.*\bnano)      # Assert that the string contains a word that starts with nano
(?=               # AND assert that the string contains...
 (?:              #  either
  .*\bregulat     #   a word starting with regulat
 |                #  OR
  .*toxic         #   any word containing toxic
 |                #  OR
  (?=             #   assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \brisk\b      #    the word risk
   |              #    OR
    \bhazard\b    #    the word hazard
   )              #    (end of inner OR alternation)
  )               #   (end of first AND condition)
  (?=             #   AND assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \bexposure\b  #    the word exposure
   |              #    OR
    \brelease\b   #    the word release
   )              #    (end of inner OR alternation)
  )               #   (end of second AND condition)
 )                #  (end of outer OR alternation)
)                 # (end of lookahead assertion)

Note that the entire regex is composed of lookahead assertions, so the match result itself will always be the empty string.

Instead, you could use single regexes:

if (/\bnano/i.test(str) &&
    ( 
        /\bregulat|toxic/i.test(str) ||
        ( 
            /\b(?:risk|hazard)\b/i.test(str) &&
            /\b(?:exposure|release)\b/i.test(str)
        )
    )
)    /* all tests pass */

Solution 2

Regular expressions have to move through the string in order. You have "nano" before "regulat" in the pattern, but they are swapped in the test string. Instead of using regexen to do this, I'd stick with plain old string parsing:

if (str.indexOf('nano') > -1) {
    if (str.indexOf('regulat') > -1 || str.indexOf('toxic') > -1
        || ((str.indexOf('risk') > - 1 || str.indexOf('hazard') > -1)
        && (str.indexOf('exposure') > -1 || str.indexOf('release') > -1)
    )) {
        /* all tests pass */
    }
}

If you want to actually capture the words (e.g. get "Regulatory" from where "regulat" is, I would split the sentence by word breaks and inspect individual words.

Share:
39,960
Q Studio
Author by

Q Studio

Q Studio: WordPress Shaper

Updated on July 18, 2022

Comments

  • Q Studio
    Q Studio almost 2 years

    I need to filter a collection of strings based on a rather complex query - in it's "raw" form it looks like this:

    nano* AND (regulat* OR *toxic* OR ((risk OR hazard) AND (exposure OR release)) )
    

    An example of one of the strings to match against:

    Workshop on the Second Regulatory Review on Nanomaterials, 30 January 2013, Brussels
    

    So, I need to match using AND OR and wildcard characters - so, I presume I'll need to use a regex in JavaScript.

    I have it all looping correctly, filtering and generally working, but I'm 100% sure my regex is wrong - and some results are being omitted wrongly - here it is:

    /(nano[a-zA-Z])?(regulat[a-zA-Z]|[a-zA-Z]toxic[a-zA-Z]|((risk|hazard)*(exposure|release)))/i
    

    Any help would be greatly appreciated - I really can't abstract my mind correctly to understand this syntax!

    UPDATE:

    Few people are point out the importance of the order in which the regex is constructed, however I have no control over the text strings that will be searched, so I need to find a solution that can work regardless of the order or either.

    UPDATE:

    Eventually used a PHP solution, due to deprecation of twitter API 1.0, see pastebin for example function ( I know it's better to paste code here, but there's a lot... ):

    function: http://pastebin.com/MpWSGtHK usage: http://pastebin.com/pP2AHEvk

    Thanks for all help

  • Q Studio
    Q Studio about 11 years
    @EP - please see my comment above, the order of the string I'm matching against is as random as it's content.. I'm just trying to "filter" over a large collection of tweets based on the regex - perhaps this is the wrong approach?
  • Explosion Pills
    Explosion Pills about 11 years
    @QLStudio is my suggestion inappropriate for that?
  • Q Studio
    Q Studio about 11 years
    @EP - yes, sorry - your solution solves the order problem.. but can I still use wildcard ( * ) characters in a normal JS search?
  • Q Studio
    Q Studio about 11 years
    I need to match nano* ( eg. nanotechnology ) and regulat* ( eg. regulation )
  • Explosion Pills
    Explosion Pills about 11 years
    indexOf works with character sets not words .. so "nanotechnology".indexOf('nano') returns 0 (which is greater than -1)
  • Q Studio
    Q Studio about 11 years
    @EP - ok, so.. I've added this and it's working - happy to be moving away from regex.. I'll do a little more testing and accept later on -thanks!
  • Tim Pietzcker
    Tim Pietzcker about 11 years
    @QLStudio: The only problem with this solution is that it will also match words that only contain your search substrings (for example, risk is contained in briskly, so this algorithm would report a false positive match. If you need to match entire words (and your examples of nano* and *toxic* suggest you do), then you need word boundaries that you only get with regex matches.
  • Q Studio
    Q Studio about 11 years
    please could you explain the [\b] - I read that "\b is a backspace character" but I'm not sure how that's relevant?
  • Tim Pietzcker
    Tim Pietzcker about 11 years
    @QLStudio: In a normal string, "\b" is indeed a backspace character. In a regex, /\b/ (equivalent to new Regex("\\b")) is a word boundary anchor. This anchor matches at the start or end of an alphanumeric word. Therefore /\brisk\b/ only matches "risk" or "There is a risk!", but not "brisk" or "risky".
  • Q Studio
    Q Studio about 11 years
    thanks for the explanation - I've moved away from javasript, because the version 1.0 of the API is shutting down, but the regexes should work almost as is in PHP I think - I'll post a complete answer when I've got it all fixed up.