Different way to specify matching new line in Python regex

16,147

The combo \n indicates a 'newline character' in both Python itself and in re expressions as well (https://docs.python.org/2.0/ref/strings.html).

In a regular Python string, \n gets translated to a newline. The newline code is then fed into the re parser as a literal character.

A double backslash in a Python string gets translated to a single one. Therefore, a string "\\n" gets stored internally as "\n", and when sent to the re parser, it in turn recognizes this combo \n as indicating a newline code.

The r notation is a shortcut to prevent having to enter double double backslashes:

backslashes are not handled in any special way in a string literal prefixed with 'r' (https://docs.python.org/2/library/re.html)

Share:
16,147
user2628641
Author by

user2628641

Updated on June 04, 2022

Comments

  • user2628641
    user2628641 almost 2 years

    I find out there are different ways to match a new line in python regex. For example, all patterns used in the code below can match a new line

    str = 'abc\n123'
    pattern = '\n'   # print outputs new line 
    pattern2 = '\\n' # print outputs \n
    pattern3 = '\\\n' # print outputs \ and new line
    pattern4 = r'\n'  # print outputs \n
    s = re.search(pattern, str).group()
    print ('a' + s + 'a')
    

    I have 2 questions about this:

    1. pattern is a new line, pattern2 and pattern4 is \n. Why python regex generates the same pattern for different string?

    2. Not sure why pattern3 also generates the same pattern. When passed to re parser, pattern3 stands for \ + new line, why re parser translates that into just matching new line?

    I am using Python 3

  • user2628641
    user2628641 about 8 years
    Thank you! But what about the third pattern '\\\n', how re parser parses it? it is one backslash + new line.
  • Jongware
    Jongware about 8 years
    @user2628641: it is parsed exactly the same. The two backslashes are parsed as a single one, followed again by a regular newline combo \n.
  • user2628641
    user2628641 about 8 years
    So ‘\\\n’ = \ + a newline character, when re parser sees this, it will try to escape the newline character, but can’t, so it will just take the newline character as the pattern. I think that is what happened?
  • Jongware
    Jongware about 8 years
    @user2628641: ah I see what you mean. Yes, the combination \+(literal newline) does mean nothing. It depends on the specific re engine what happens next; most indeed then ignore the backslash, and store only the following character in the to-search expression.