Regular expression matching a multiline block of text

327,833

Solution 1

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

Solution 2

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
  • Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
  • [A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final \n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

Solution 3

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

Solution 4

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

Solution 5

find:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\1 = some_varying_text

\2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])
Share:
327,833
Jan
Author by

Jan

Currently working at a Copenhagen-based start-up within Network and Asset Management. Formerly at the Barcelona Supercomputer Center, Storage Systems Group developing kernel-level network file systems. Specialty and interest for distributed file systems development in general. M.Sc. Computer Science, Copenhagen University

Updated on July 08, 2022

Comments

  • Jan
    Jan almost 2 years

    I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)

    some Varying TEXT\n
    \n
    DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
    [more of the above, ending with a newline]\n
    [yep, there is a variable number of lines here]\n
    \n
    (repeat the above a few hundred times).
    

    I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later). I've tried with a few approaches:

    re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
    re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
    

    and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text. I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

    If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.

    • UncleZeiv
      UncleZeiv about 15 years
      Is there something else in the file besides the first line and the uppercase text? I'm not sure why you would use a regex instead of splitting all the text at newline characters and taking the first element as "some_Varying_TEXT".
    • Admin
      Admin about 15 years
      yes, regex are the wrong tool for this.
    • MiniQuark
      MiniQuark about 15 years
      Your sample text doesn't have a leading > character. Should it?
  • MiniQuark
    MiniQuark about 15 years
    Unfortunately, this regular expression will also match groups of capital letters separated by empty lines. It might not be a big deal though.
  • MiniQuark
    MiniQuark about 15 years
    You may want to replace the second dot in the regex by [A-Z] if you don't want this regular expression to match just about any text file with an empty second line. ;-)
  • Alan Moore
    Alan Moore about 15 years
    My impression is that the target files will conform to a definite (and repeating) pattern of empty vs. non-empty lines, so it shouldn't be necessary to specify [A-Z], but it probably won't hurt, either.
  • Alan Moore
    Alan Moore about 15 years
    match() only returns one match, at the very beginning of the target text, but the OP said there would be hundreds of matches per file. I think you would want finditer() instead.
  • Andrew Dalke
    Andrew Dalke about 15 years
    Looks like coonj likes FASTA files. ;)
  • Jan
    Jan about 15 years
    Definitively the easiest way if there was only one, and its also workable with more, if some more logic is added. There's about 885 proteins in this specific dataset though, and I felt that a regex should be able to handle this.
  • Jan
    Jan about 15 years
    This solution worked beautifully. As an aside, I apologize, since I obviously didn't clarify the situation enough (and also for the lateness of this reply). Thanks for your help!
  • pauljohn32
    pauljohn32 about 3 years
    This is the best, most direct answer, IMHO.
  • grantr
    grantr about 2 years
    this is a great answer- you may have to modify if you need to span multiple linebreaks in a row \n\n