How to get the part of a file after the first line that matches a regular expression

217,399

Solution 1

The following will print the line matching TERMINATE till the end of the file:

sed -n -e '/TERMINATE/,$p'

Explained: -n disables default behavior of sed of printing each line after executing its script on it, -e indicated a script to sed, /TERMINATE/,$ is an address (line) range selection meaning the first line matching the TERMINATE regular expression (like grep) to the end of the file ($), and p is the print command which prints the current line.

This will print from the line that follows the line matching TERMINATE till the end of the file: (from AFTER the matching line to EOF, NOT including the matching line)

sed -e '1,/TERMINATE/d'

Explained: 1,/TERMINATE/ is an address (line) range selection meaning the first line for the input to the 1st line matching the TERMINATE regular expression, and d is the delete command which delete the current line and skip to the next line. As sed default behavior is to print the lines, it will print the lines after TERMINATE to the end of input.

If you want the lines before TERMINATE:

sed -e '/TERMINATE/,$d'

And if you want both lines before and after TERMINATE in two different files in a single pass:

sed -e '1,/TERMINATE/w before
/TERMINATE/,$w after' file

The before and after files will contain the line with terminate, so to process each you need to use:

head -n -1 before
tail -n +2 after

IF you do not want to hard code the filenames in the sed script, you can:

before=before.txt
after=after.txt
sed -e "1,/TERMINATE/w $before
/TERMINATE/,\$w $after" file

But then you have to escape the $ meaning the last line so the shell will not try to expand the $w variable (note that we now use double quotes around the script instead of single quotes).

I forgot to tell that the new line is important after the filenames in the script so that sed knows that the filenames end.

How would you replace the hardcoded TERMINATE by a variable?

You would make a variable for the matching text and then do it the same way as the previous example:

matchtext=TERMINATE
before=before.txt
after=after.txt
sed -e "1,/$matchtext/w $before
/$matchtext/,\$w $after" file

to use a variable for the matching text with the previous examples:

## Print the line containing the matching text, till the end of the file:
## (from the matching line to EOF, including the matching line)
matchtext=TERMINATE
sed -n -e "/$matchtext/,\$p"
## Print from the line that follows the line containing the
## matching text, till the end of the file:
## (from AFTER the matching line to EOF, NOT including the matching line)
matchtext=TERMINATE
sed -e "1,/$matchtext/d"
## Print all the lines before the line containing the matching text:
## (from line-1 to BEFORE the matching line, NOT including the matching line)
matchtext=TERMINATE
sed -e "/$matchtext/,\$d"

The important points about replacing text with variables in these cases are:

  1. Variables ($variablename) enclosed in single quotes ['] won't "expand" but variables inside double quotes ["] will. So, you have to change all the single quotes to double quotes if they contain text you want to replace with a variable.
  2. The sed ranges also contain a $ and are immediately followed by a letter like: $p, $d, $w. They will also look like variables to be expanded, so you have to escape those $ characters with a backslash [\] like: \$p, \$d, \$w.

Solution 2

As a simple approximation you could use

grep -A100000 TERMINATE file

which greps for TERMINATE and outputs up to 100,000 lines following that line.

From the man page:

-A NUM, --after-context=NUM

Print NUM lines of trailing context after matching lines. Places a line containing a group separator (--) between contiguous groups of matches. With the -o or --only-matching option, this has no effect and a warning is given.

Solution 3

A tool to use here is AWK:

cat file | awk 'BEGIN{ found=0} /TERMINATE/{found=1}  {if (found) print }'

How does this work:

  1. We set the variable 'found' to zero, evaluating false
  2. if a match for 'TERMINATE' is found with the regular expression, we set it to one.
  3. If our 'found' variable evaluates to True, print :)

The other solutions might consume a lot of memory if you use them on very large files.

Solution 4

If I understand your question correctly you do want the lines after TERMINATE, not including the TERMINATE-line. AWK can do this in a simple way:

awk '{if(found) print} /TERMINATE/{found=1}' your_file

Explanation:

  1. Although not best practice, you could rely on the fact that all variables defaults to 0 or the empty string if not defined. So the first expression (if(found) print) will not print anything to start off with.
  2. After the printing is done, we check if this is the starter-line (that should not be included).

This will print all lines after the TERMINATE-line.


Generalization:

  • You have a file with start- and end-lines and you want the lines between those lines excluding the start- and end-lines.
  • start- and end-lines could be defined by a regular expression matching the line.

Example:

$ cat ex_file.txt
not this line
second line
START
A good line to include
And this line
Yep
END
Nope more
...
never ever
$ awk '/END/{found=0} {if(found) print} /START/{found=1}' ex_file.txt
A good line to include
And this line
Yep
$

Explanation:

  1. If the end-line is found no printing should be done. Note that this check is done before the actual printing to exclude the end-line from the result.
  2. Print the current line if found is set.
  3. If the start-line is found then set found=1 so that the following lines are printed. Note that this check is done after the actual printing to exclude the start-line from the result.

Notes:

  • The code rely on the fact that all AWK variables defaults to 0 or the empty string if not defined. This is valid, but it may not be best practice so you could add a BEGIN{found=0} to the start of the AWK expression.
  • If multiple start-end-blocks are found, they are all printed.

Solution 5

grep -A 10000000 'TERMINATE' file       

is much, much faster than sed, especially working on really a big file. It works up to 10M lines (or whatever you put in), so there isn't any harm in making this big enough to handle about anything you hit.

Share:
217,399

Related videos on Youtube

Yugal Jindle
Author by

Yugal Jindle

Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid. -- Anonymous Github : YugalJindle Twitter : @YugalJindle Google+ : +YugalJindle LinkedIn : http://www.linkedin.com/in/YugalJindle

Updated on July 25, 2022

Comments

  • Yugal Jindle
    Yugal Jindle almost 2 years

    I have a file with about 1000 lines. I want the part of my file after the line which matches my grep statement.

    That is:

    cat file | grep 'TERMINATE'     # It is found on line 534
    

    So, I want the file from line 535 to line 1000 for further processing.

    How can I do that?

    • Jacob
      Jacob almost 13 years
      UUOC (Useless Use of cat): grep 'TERMINATE' file
    • Yugal Jindle
      Yugal Jindle almost 13 years
      I know that, its like I use it that way. Lets come back to the question.
    • aioobe
      aioobe almost 13 years
      This is a perfectly fine programming question, and well suited for stackoverflow.
    • runeks
      runeks almost 8 years
      @Jacob It's not useless use of cat at all. Its use is to print a file to standard output, which means we can use greps standard input interface to read data in, rather than having to learn what switch to apply to grep, and sed, and awk, and pandoc, and ffmpeg etc. when we want to read from a file. It saves time because we don't have to learn a new switch every time we want to do the same thing: read from a file.
    • LOAS
      LOAS almost 7 years
      @runeks I agree with your sentiment - but you can achieve that without cat: grep 'TERMINATE' < file. Maybe it does make the reading a bit harder - but this is shell scripting, so that's always going to be a problem :)
    • kvantour
      kvantour almost 5 years
  • Yugal Jindle
    Yugal Jindle almost 13 years
    Can you explain what are you doing ?
  • Yugal Jindle
    Yugal Jindle almost 13 years
    That might work for this, but I need to code it into my script to process many files. So, show some generic solution.
  • Yugal Jindle
    Yugal Jindle almost 13 years
    --after-context is fine but not in all cases.
  • Yugal Jindle
    Yugal Jindle almost 13 years
    Can you suggest something else.. ??
  • Mu Qiao
    Mu Qiao almost 13 years
    I copied the content of "file" into the $content variable. Then I removed all the characters until "TERMINATE" was seen. It didn't use greedy matching, but you can use greedy matching by ${content##*TERMINATE}.
  • Mu Qiao
    Mu Qiao almost 13 years
    here is the link of the bash manual: gnu.org/software/bash/manual/…
  • Yugal Jindle
    Yugal Jindle almost 13 years
    How can we get the lines before TERMINATE and delete all that follows ?
  • michelgotta
    michelgotta about 11 years
    I think this is one practical solution!
  • PiyusG
    PiyusG about 10 years
    similarly -B NUM, --before-context=NUM Print NUM lines of leading context before matching lines. Places a line containing a group separator (--) between contiguous groups of matches. With the -o or --only-matching option, this has no effect and a warning is given.
  • Znik
    Znik over 9 years
    what will happen if file is 100GB size ?
  • Znik
    Znik over 9 years
    file is scanned twice. what if it is 100GB size?
  • 123
    123 about 9 years
    For the number your can also use more +7 file
  • Sébastien Clément
    Sébastien Clément over 8 years
    How would your replace the hardcoded TERMINAL by a variable?
  • tripleee
    tripleee over 8 years
    Extracting a line number with grep so you can feed it to tail is a wasteful antipattern. Finding the match and printing up through the end of the file (or, conversely, printing and stopping at the first match) is eminently done with the normal, essential regex tools themselves. The massive grep | tail | sed | awk is also in and of itself a massive useless use of grep and friends.
  • Jose Martinez
    Jose Martinez over 8 years
    this solution worked for me because i can easily use variables as my string to check for.
  • fbicknel
    fbicknel almost 8 years
    I think s*he was trying to give us something that would find the /last instance/ of 'TERMINATE' and give the lines from that instance on. Other implementations give you the first instance onward. The LINE_NUMBER should probably look like this, instead: LINE_NUMBER=$(grep -o -n 'TERMINATE' $OSCAM_LOG | tail -n 1| awk -F: '{print $1}') Maybe not the most elegant way, but it seems to get the job done. ^.^
  • fbicknel
    fbicknel almost 8 years
    ... or all in one line, but ugly: tail -n +$(grep -o -n 'TERMINATE' $YOUR_FILE_NAME | tail -n 1| awk -F: '{print $1}') $YOUR_FILE_NAME
  • fbicknel
    fbicknel almost 8 years
    .... and I was going to go back and edit out $OSCAM_LOG in lieu of $YOUR_FILE_NAME... but can't for some reason. No idea where $OSCAM_LOG came from; I just mindlessly parroted it. o.O
  • mivk
    mivk almost 8 years
    This includes the matching line, which is not what is wanted in this question.
  • fedorqui
    fedorqui almost 8 years
    @mivk well, this is also the case of the accepted answer and the 2nd most upvoted, so the problem may be with a misleading title.
  • tripleee
    tripleee almost 8 years
    Doing this in Awk alone is a common task in Awk 101. If you are already using a more capable tool just to get the line number, let go of tail and do the task in the more capable tool altogether. Anyway, the title clearly says "first match".
  • tripleee
    tripleee almost 8 years
    Downvote: This is horrible (reading the file into a variable) and wrong (using the variable without quoting it; and you should properly use printf or make sure you know exactly what you are passing to echo.).
  • Mad Physicist
    Mad Physicist almost 8 years
    Downvoted because this is a crappy solution, but then upvoted because 90% of the answer is caveats.
  • mato
    mato over 7 years
    One use case that's missing here is how to print lines after the last marker (if there can be multiple of them in the file .. think log files etc).
  • Karalga
    Karalga over 7 years
    The example sed -e "1,/$matchtext/d" does not work when $matchtext occurs in the first line. I had to change it to sed -e "0,/$matchtext/d".
  • Samveen
    Samveen about 7 years
    If the line number is known, then grep isn't even required; you can just use tail -n $NUM, so this isn't really an answer.
  • Lemming
    Lemming almost 7 years
    Nice idea! If you are uncertain about the size of the context you may count the lines of file instead: grep -A$(cat file | wc -l) TERMINATE file
  • Aleksander Stelmaczonek
    Aleksander Stelmaczonek almost 7 years
    Simple, elegant and very generic. In my case it was printing everything until second occurrence of '###': cat file | awk 'BEGIN{ found=0} /###/{found=found+1} {if (found<2) print }'
  • Timothy Swan
    Timothy Swan over 6 years
    I need something that limits characters, not lines.
  • Ahmed
    Ahmed almost 6 years
    If you want the exact rest line in your file after the pattern TERMINATE, you can une this : grep -A$(($(cat file | wc -l)-$(grep -n TERMINATE file | awk -F":" '{print $1}'))) TERMINATE file
  • aioobe
    aioobe almost 6 years
    @Ahmed, how is that better than grep -A$(wc -l < file) TERMINATE file?
  • Ahmed
    Ahmed almost 6 years
    @aioobe because it returns only the lines that remain for the end of file $(($(cat file | wc -l)-$(grep -n TERMINATE file | awk -F":" '{print $1}')))
  • aioobe
    aioobe almost 6 years
    @Ahmed, but so does grep -A$(wc -l < file) TERMINATE file, right?
  • tripleee
    tripleee almost 6 years
    A tool not to use here is cat. awk is perfectly capable of taking one or more filenames as arguments. See also stackoverflow.com/questions/11710552/useless-use-of-cat
  • user1169420
    user1169420 over 5 years
    Awesome Awesome example. Just spent 2 hours looking at csplit, sed, and all manner of over complicated awk commands. Not only did this do what I wanted but shown simple enough to infer how to modify it to do a few other related things I needed. Makes me remember awk is great and not just in indecipherable mess of crap. Thanks.
  • szmoore
    szmoore over 5 years
    Using wc -l to make sure you don't accidentally truncate lines is nice, but you just need NUM > lines remaining not NUM == lines remaining. The calculation of the "exact" number of lines remaining is going to read the file many more times than is necessary and is more complicated than the sed or awk solutions (the main advantage of grep is it's the easiest to remember).
  • user000001
    user000001 about 5 years
    {if(found) print} is a bit of an anti-pattern in awk, it's more idiomatic to replace the block with just found or found; if you need another filter afterwards.
  • John_Smith
    John_Smith about 5 years
    @user000001 please explain. I do not understand what to replace and how. Anyway I think the way its written makes it very clear what is going on.
  • user000001
    user000001 about 5 years
    You would replace awk '{if(found) print} /TERMINATE/{found=1}' your_file with awk 'found; /TERMINATE/{found=1}' your_file, they should both do the same thing.
  • Znik
    Znik over 3 years
    unfortunately grep doesn't support INFINITE as NUM for -A and -B option :( then we must add very big numbers, but we don't know what is maximum int for them.
  • Pavan Kumar
    Pavan Kumar almost 3 years
    One stop shop for my problem. Prefer to double upvote this answer, but I can't.
  • Peter Mortensen
    Peter Mortensen over 2 years
    What do you mean by "handle about anything you hit" (seems incomprehensible)? Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).
  • Mxt
    Mxt about 2 years
    @Karalga had the same issue, except sed -e "0,/$matchtext/d" still displays $matchtext for me, so I did this: sed -e "0,/$matchtext/d" | tail -n +2. But sed -e '1i\\n' | sed -e "1,/$matchtext/d" should work universally.