How do I grep for multiple patterns on multiple lines?

command-line grep text-processing

73,678

Solution 1

Updated 18-Nov-2016 (since grep behavior is changed: grep with -P parameter now doesn't support ^ and $ anchors [on Ubuntu 16.04 with kernel v:4.4.0-21-generic])(wrong (non-)fix)

$ grep -Pzo "begin(.|\n)*\nend" file
begin
Some text goes here.  
end

note: for other commands just replace the '^' & '$' anchors with new-line anchor '\n' ______________________________

With grep command:

grep -Pzo "^begin\$(.|\n)*^end$" file

If you want don't include the patterns "begin" and "end" in result, use grep with Lookbehind and Lookahead support.

grep -Pzo "(?<=^begin$\n)(.|\n)*(?=\n^end$)" file

Also you can use \K notify instead of Lookbehind assertion.

grep -Pzo "^begin$\n\K(.|\n)*(?=\n^end$)" file

\K option ignore everything before pattern matching and ignore pattern itself.
\n used for avoid printing empty lines from output.

Or as @AvinashRaj suggests there are simple easy grep as following:

grep -Pzo "(?s)^begin$.*?^end$" file

grep -Pzo "^begin\$[\s\S]*?^end$" file

(?s) tells grep to allow the dot to match newline characters.
[\s\S] matches any character that is either whitespace or non-whitespace.

And their output without including "begin" and "end" is as following:

grep -Pzo "^begin$\n\K[\s\S]*?(?=\n^end$)" file # or grep -Pzo "(?<=^begin$\n)[\s\S]*?(?=\n^end$)"

grep -Pzo "(?s)(?<=^begin$\n).*?(?=\n^end$)" file

see the full test of all commands here (_{out of dated as grep behavior with -P parameter is changed})

Note:

^ point the beginning of a line and $ point the end of a line. these added to the around of "begin" and "end" to matching them if they are alone in a line.
In two commands I escaped $ because it also using for "Command Substitution"($(command)) that allows the output of a command to replace the command name.

From man grep:

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.

-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

-z, --null-data
      Treat the input as a set of lines, each terminated by a zero byte (the ASCII 
      NUL character) instead of a newline. Like the -Z or --null option, this option 
      can be used with commands like sort -z to process arbitrary file names.

Solution 2

In case your grep doesn't support perl syntax (-P), you can try joining the lines, matching the pattern, then expanding the lines again as below:

$ tr '\n' , < foo.txt | grep -o "begin.*end" | tr , '\n'
begin
Some text goes here.
end

73,678

Iker

Updated on September 18, 2022

Comments

Iker over 1 year
To be precise
```
Some text
begin
Some text goes here.
end
Some more text
```
and I want to extract entire block that starts from "begin" till "end".

with awk we can do like awk '/begin/,/end/' text.

How to do with grep?
- h3. over 9 years
  
  Same question on Unix & Linux. Don't do that.
Avinash Raj over 9 years

change your grep like grep -Pzo "(?<=begin\n)(.|\n)*(?=\nend)" file to not to print \n character which exists on the line begin.
Avinash Raj over 9 years

Use DOTALL modifier to make dot to match even newline chars also grep -Pzo "(?s)begin.*?end" file
αғsнιη over 9 years

@AvinashRaj thank you I added to avoiding \n but you can post your another solution as your own answer ;)
Avinash Raj over 9 years

Why? add it to yours. I have more reps :-)
terdon over 9 years

You might want to use grep -Pzo "begin(.|\n)*\nend" file instead to make sure that end only matches at the beginning of a line and not in things like bend.
αғsнιη over 9 years

@terdon Can I use ^end instead? or even better ^end$?
terdon over 9 years

Huh, yes you can . I had thought that the ^ would only match the beginning of the file when using -z but apparently not.
terdon over 9 years

The man page says: "-z: Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline." so I would expect the ^ and $ to match just before and just after a \0 instead. Apparently, they're hard coded to match \n.
musbach over 7 years

The siólution doesn't work. It produces an error: grep: ein nicht geschütztes ^ oder $ wird mit -Pz nicht unterstützt The translation of the error is something like: grep: a not protected ^ or $ is not supported with -Pz
terdon over 7 years

I guess grep's behavior has changed. I just tested and musbach is right, the ^ and $ don't work with -Pz. It should work as expected if your replace ^ and $ with \n though.
αғsнιη over 7 years

@terdon paste.ubuntu.com/9096940
terdon over 7 years

Yes, I know, that's in your answer. I'm sure it worked when you posted this, but try it again today. The behavior of grep seems to have changed.
musbach over 7 years

@terdon you are right. This works: grep -Pzo "begin\n(.|\n)*\nend\n" file. If I put before begin a \n (grep -Pzo "\nbegin\n(.|\n)*\nend\n" file) I get blank line and than the correct output. I guess that \n produces a linefeed but it looks strange to me. @KasiyA I am on Ubuntu 16.04. On what OS are you?
terdon over 7 years

@musbach yes, \n is the newline character. You get an extra newline because with \nbegin you are including the newline character at the end of the previous line, so that's printed as a blank line.
αғsнιη over 7 years

at that time I was on 14.04, but right now I'm far away from my Ubuntu 16.04 to test it, once I come with 16.04 will double check, but for sure grep behavior is changed as Mr. terdon confirmed, @musbach
musbach over 7 years

Yes, the answer should be corrected or flaged as wrong.
αғsнιη over 7 years

I have asked this grep wrong behavior grep command doesn't support start '^' and '$' end of line anchors when it's with -Pz @terdon
musbach over 7 years

I added also that it works on 14.04 and it doesn't work not on 10.4 and it doesn't work on 16.04 (see blow). Why it just works on 14.04 is very strange.
terdon over 7 years

@musbach there's no way (and no reason) to flag an answer as wrong. You've left a comment explaining it, that's all that's needed. The answer was correct when posted, after all.