python regex match optional square brackets

21,120

Solution 1

I got all of them to match using this (You'll need to add the case-insensitive flag):

(^[a-z][a-z\'&\(\) ]+\bv\b[a-z&\'\(\) ]+(?:.*?) \[?\d+ \w+ \d{4}\]?)

Regex Demo

Explanation:

  • ( Begin capture group
    • [a-z\'&\(\) ]+ Match one or more of the characters in this group
    • \b Match a word boundary
    • v Match the character 'v' literally
    • \b Match a word boundary
    • [a-z&\'\(\) ]+ Match one or more of the characters in this group
    • (?: Begin non-capturing group
      • .*? Match anything
    • ) End non-capturing group
    • \[?\d+ \w+ \d{4}\]? Match a date, optionally surrounded by brackets
  • ) End capture group

Solution 2

How to make Square brackets optional, can be achieved like this:

[\[]* with the * it makes the opening [ optional.

A few recommendations if I may:

  • This \d\d\d\d could be also expressed like this as well \d{4}

  • [v|V] in regex what is inside the [] is already one or other | is not necessary [vV]

And here is what an online demo

Solution 3

Using your regex and input strings, it looks like you will match only the 2nd line (if you get rid of the '^' at the beginning of the regex. I've added inline comments to each section of the regular expression you provided to make it more clear.

Can you indicate what you are trying to capture from each line? Do you want the entire string? Only the word immediately preceding the lone letter 'v'? Do you want the date captured separately?

Depending on the portions that you wish to capture, each section can be broken apart into their respective match groups: regex101.com example. This is a little looser than yours (capturing the entire section between quotation marks instead of only the single word immediately preceding the lone 'v'), and broken apart to help readability (each "group" on its own line).

This example also assumes the newline is intentional, and supports the newline component (warning: it COULD suck up more than you intend, depending on whether the date at the end gets matched or not).

Share:
21,120
user740875
Author by

user740875

Updated on January 04, 2020

Comments

  • user740875
    user740875 over 4 years

    I have the following strings:

    1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003",     
    2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]",  
    3 '''GREGORY LANCASTER AND JOHN HENRY HUNTER V CULLEN INVESTMENTS LIMITED AND  
    ERIC JOHN WATSON CA CA51/03 26 May 2003''' 
    

    I am trying to find a regular expression which matches all of them. I don't know how to match optional square brackets around the date at the end of the string eg [16 May 2014].

    casename = re.compile(r'(^[A-Z][A-Za-z\'\(\) ]+\b[v|V]\b[A-Za-z\'\(\) ]+(.*?)[ \[ ]\d+    \w+ \d\d\d\d[\] ])', re.S) 
    

    The date regex at the end only matches cases with dates in square bracket but not the ones without.

    Thank to everybody who answered. @Matt Clarkson what I am trying to match is a judicial decision 'handle' in a much larger text. There is a large variation within those handles, but they all start at the beginning of a line have 'v' for versus between the party names and a date at the end. Mostly the names of the parties are in capital but not exclusively. I am trying to have only one match per document and no false positives.

  • user740875
    user740875 over 9 years
    The thing with the question mark was exactly what I was looking for: "[?". That solved my problem. Do you know how that feature is called or where it is documented?
  • RevanProdigalKnight
    RevanProdigalKnight over 9 years
    ? in regex means "0 or 1 of the preceding character (range)", which pretty much means that it's optional. It can be there, or it could not be there, just so long as there's no more than one.