RegEx Get string between two strings that has line breaks

11,686

Solution 1

Use re.S or re.DOTALL flags. Or prepend the regular expression with (?s) to make . matches all character (including newline).

Without the flags, . does not match newline.

(?s)(?<=Test)(.*?)(?=</td>)

Example:

>>> s = '''<td scope="row" align="left">
...       My Class: TEST DATA<br>
...       Test Section: <br>
...       MY SECTION<br>
...       MY SECTION 2<br>
...     </td>'''
>>>
>>> import re
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s)  # without flags
[]
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s, flags=re.S)
[' Section: <br>\n      MY SECTION<br>\n      MY SECTION 2<br>\n    ']
>>> re.findall('(?s)(?<=Test)(.*?)(?=</td>)', s)
[' Section: <br>\n      MY SECTION<br>\n      MY SECTION 2<br>\n    ']

Solution 2

Get the matched group from index 1

Test Section:([\S\s]*)</td>

Live demo

Note: change the last part as per your need.

sample code:

import re
p = re.compile(ur'Test Section:([\S\s]*)</td>', re.MULTILINE)
test_str = u"..."

re.findall(p, test_str)

Pattern Explanation:

  Test Section:            'Test Section:'
  (                        group and capture to \1:
    [\S\s]*                  any character of: non-whitespace (all
                             but \n, \r, \t, \f, and " "), whitespace
                             (\n, \r, \t, \f, and " ") (0 or more
                             times (matching the most amount
                             possible))
  )                        end of \1
  </td>                    '</td>'
Share:
11,686
CodeLikeBeaker
Author by

CodeLikeBeaker

Updated on July 28, 2022

Comments

  • CodeLikeBeaker
    CodeLikeBeaker almost 2 years

    I have the following test (formatted just like below):

    <td scope="row" align="left">
          My Class: TEST DATA<br>
          Test Section: <br>
          MY SECTION<br>
          MY SECTION 2<br>
        </td>
    

    I'm attempting to get the text between "Test Section: and the after the MY SECTION

    I've tried several attempts with different RegEx patterns and I'm not getting anywhere.

    If I do:

    (?<=Test)(.*?)(?=<br)
    

    Then I get the correct response of:

    ' Section: '
    

    But, if I do

    (?<=Test)(.*?)(?=</td>)
    

    I get no results. The results should be "MY SECTIon
    MY SECTION 2
    "

    I've tried using RegEx Multiline as well with no results.

    Any help would be appreciated.

    If it matters I'm coding in Python 2.7.

    If something is not clear, or you need more info, please let me know.

  • CodeLikeBeaker
    CodeLikeBeaker almost 10 years
    Thank you for the great response!
  • CodeLikeBeaker
    CodeLikeBeaker almost 10 years
    Thank you for the great response!