Extract paragraph separated with *** using AWK

6,452

Solution 1

Tell awk to print between the two delimiters. Specifically:

awk '/\*{4,}/,/<np>/' file

That will also print the lines containing the delimiters, so you can remove them with:

awk '/\*{4,}/,/<np>/' file | tail -n +2 | head -n -1

Alternatively, you can set a variable to true if a line matches the 1st delimiter and to false when it matches the second and only print when it is true:

awk '/\*{4,}/{a=1; next}/<np>/{a=0}(a==1){print}' file

The command above will set a to 1 if the current line matches 4 or more * and will also skip to the next line. This means that the *** line will never be printed.


This was in answer to the original, misunderstood, version of the question. I'm leaving it here since it can be useful in a slightly different situation.

First of all, you don't want FS (field separator), you want RS (record separator). Then, to pass a literal *, you need to escape it twice. Once to escape the * and once to escape the backslash (otherwise, awk will try to match it in the same way as \r or \t). Then, you print the 2nd "line":

$ awk -vRS='\\*\\*\\*' 'NR==2' file

thingsIwantToRead1   
thingsIwantToRead2   
thingsIwantToRead3  

To avoid the blank lines around the output, use:

$ awk -vRS='\n\\*\\*\\*\n' 'NR==2' file
thingsIwantToRead1   
thingsIwantToRead2   
thingsIwantToRead3  

Note that this assumes a *** after each paragraph, not only after the first one as you show.

Solution 2

In addition to @terdon's answer, with awk (and sed) you can use range pattern:

awk '/sep1/,/sep2/{print}' file

or

sed -n '/sep1/,/sep2/p' file

will print everything (including) sep1 and sep2. That is:

~$ awk '/sep1/,/sep2/{print}' file
sep1
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3
sep2

In your case:

~$ awk '/\*\*\*/,/^$/{print}' file
***
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3
 

Then you might want to delete the first and last line.

For instance with:

~$ sed -n '/\*\*\*/,/^$/p' file | sed '1d;$d'
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3

or

~$ awk '/\*\*\*/,/^$/{print}' file | awk 'NR>1&&!/^$/ {print}'
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3

If your paragraph isn't too long.

Solution 3

With sed there are two ways to go with this. You can select inclusively or exclusively. In your case, an inclusive selection means printing all lines beginning with a match for '^*\*\*' up to and including one of either ^ *<np> (whatever that is) or ^$ a blank line.

An inclusive selection can be specified with any of the range expressions demonstrated in the other answers and involves specifying a start printing here pattern through to a all the way through here pattern.

An exclusive selection works in the opposite way. It specifies a stop printing before here pattern through to a start printing after here pattern. For your example data - and allowing for a stop printing before here pattern which will match either of a blank-line or that <np> thing:

sed -e 'x;/^\( *<np>.*\)*$/,/^*\** *$/c\' -e '' <infile >outfile
  • x
    • Swaps hold and pattern spaces. This institutes a look-behind - sed is always one-line behind input - and the first line is always blank.
  • /^\( *<np>.*\)*$/
    • This selects a stop printing before here line that matches from head to tail zero or more occurrences in the match group. Two kinds of lines can match zero or more occurrences of that - either a blank line or one with any number of <spaces> at the head of the line followed by the string <np>.
  • /^*\** *$/
    • This selects a start printing after here line which opens with at least one * asterisk character and continues to the end of the line with only zero or more occurrences of the * asterisk and possibly closed by any number of spaces.
  • c\' -e ''
    • This changes the entire blocked selection to a single blank line, squeezing all unwanted lines to the string EOF.

So any number of lines occurring before ^*\** *$ and after the first following ^\( *<np>.*\)*$ are always squeezed down to only a single blank, and only the first occurring paragraph after a match for ^*\** *$ is printed to stdout. It prints...


2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311 

That assumes you want to handle any number of occurrences of the paragraph pattern in input. If you only want the first however, provided you have GNU grep and that infile is a regular, lseekable file:

{   grep -xm1 '*\** *'        >&2
    sed -n '/^\( *<np>.*\)*$/q;p'
}   <infile 2>/dev/null >outfile

... will work as well.

And actually, I guess, there are three ways. The third might look like:

sed 'H;$!d;x;s/\(\n\*\** *\n\(\([0-9./: ]*\n\)*\)\)*./\2/g'

...which reads in the whole file and then globally substitutes away every character which doesn't fall within the specifications of the matched lines. It prints the same as before, but those are a pain to write, and they're only safe performance-wise when you balance the optionals against any character.

Solution 4

Updated version based on question's edit:

Using Perl:

< inputfile perl -0777 -pe 's/.*[*]+\n(.*) <np>\n.*/$1/s' > outputfile
  • < inputfile: redirects the content of inputfile to perl's stdin
  • -0777: forces Perl to slurp the whole file at once instead of line by line
  • -p: forces Perl to print the lines
  • -e: forces Perl to read a line of program from the arguments
  • > outputfile: redirects the content of perl's stdout to outputfile

Regex breakdown:

  • s: asserts to perform a substitution
  • /: starts the search pattern
  • .*[*]+\n: matches any number of any character up to the end of a string ending with one or more * character immediately followed by a newline character
  • (.*) <np>: matches and groups any number of any character up to any character immediately followed by a <np>\n string
  • .*: matches any number of any character
  • /: stops the search pattern / starts the replace pattern
  • $1: replaces with the captured group
  • /: stops the replace pattern / starts the modifiers
  • s: asserts to treat the input string as a single line, forcing . to also match newline characters

Sample output:

~/tmp$ cat inputfile
13.2000000000     , 3*0.00000000000       ,  11.6500000000     , 3*0.00000000000       ,  17.8800000000

Blablabla

  SATELLITE EPHEMERIS
     ===================
Output frame: Mean of J2000

       Epoch                  A            E            I           RA           AofP          TA      Flight Ang
*****************************************************************************************************************
2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311
 <np>
 ----------------
 Predicted Orbit:
 ----------------

 Blablabla
~/tmp$ < inputfile perl -0777 -pe 's/.*[*]+\n(.*) <np>\n.*/$1/s'
2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311
~/tmp$ 

Original version:

Using Perl:

< inputfile perl -0777 -pe 's/.*[*]{3}\n(.*\n)\n.*/$1/s' > outputfile
  • < inputfile: redirects the content of inputfile to perl's stdin
  • -0777: forces Perl to slurp the whole file at once instead of line by line
  • -p: forces Perl to print the lines
  • -e: forces Perl to read a line of program from the arguments
  • > outputfile: redirects the content of perl's stdout to outputfile

Regex breakdown:

  • s: asserts to perform a substitution
  • /: starts the search pattern
  • .*[*]{3}\n: matches any number of any character up to the end of a ***\n string
  • (.*\n)\n: matches and groups any number of any character up to a newline character immediately followed by a newline character
  • .*: matches any number of any character
  • /: stops the search pattern / starts the replace pattern
  • $1: replaces with the captured group
  • /: stops the replace pattern / starts the modifiers
  • s: asserts to treat the input string as a single line, forcing . to also match newline characters

Sample output:

~/tmp$ cat inputfile
blablabla
blablabla
***
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3

blablabla
blablabla
~/tmp$ < inputfile perl -0777 -pe 's/.*[*]{3}\n(.*\n)\n.*/$1/s'
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3
~/tmp$ 
Share:
6,452

Related videos on Youtube

JoVe
Author by

JoVe

Updated on September 18, 2022

Comments

  • JoVe
    JoVe over 1 year

    I have a file like below:

    blablabla
    blablabla
    ***
    thingsIwantToRead1
    thingsIwantToRead2
    thingsIwantToRead3
    
    blablabla
    blablabla
    

    I want to extract the paragraph with thingsIwantToRead. When I had to deal with such a problem, I used AWK like this:

    awk 'BEGIN{ FS="Separator above the paragraph"; RS="" } {print $2}' $file.txt | awk 'BEGIN{ FS="separator below the paragraph"; RS="" } {print $1}'
    

    And it worked.

    In this case, I tried to put FS="***", "\*{3}", "\*\*" (it is not working because AWK treats it like a normal asterisk), "\\*\\*" or whatever regex I could think of, but it's not working (it's printing nothing).

    Do you know why?

    If not, do you know another way to deal with my problem?

    Below an extract of the file I want to parse:

    13.2000000000     , 3*0.00000000000       ,  11.6500000000     , 3*0.00000000000       ,  17.8800000000
    
    Blablabla
    
      SATELLITE EPHEMERIS
         ===================
    Output frame: Mean of J2000
    
           Epoch                  A            E            I           RA           AofP          TA      Flight Ang
    *****************************************************************************************************************
    2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
    2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
    2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
    2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311
     <np>
     ----------------
     Predicted Orbit:
     ----------------
    
     Blablabla
    

    And I want to extract:

    2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
    2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
    2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
    2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311
    

    And the command I tried to use to get the numbers after the line of *'s:

    `awk 'BEGIN{ FS="\\*{2,}"; RS="" } {print $2}' file | awk 'BEGIN{ FS="<np>"; RS="" } {print $1}'`
    
    • terdon
      terdon almost 9 years
      Are there *** after the target paragraph?
    • JoVe
      JoVe almost 9 years
      No. In the real file, there is a line containing <np> or ^L depending on the editor I use (nedit or vi), and I dont know what this means...
    • terdon
      terdon almost 9 years
      So, what part of the file do you want to extract? Are the slashes (at the beginning of the * and -- lines) actually part of the line? Do you want the data between **** and <np>? Or until the next blank line?
    • JoVe
      JoVe almost 9 years
      Sorry no baskslashes, I added them when the text was not in a code block and forgot to remove them. Should be ok now, thanks.
    • JoVe
      JoVe almost 9 years
      The file is generated by a routine I have no knowledge of, and depending on the editor I use there is either <np> (with nedit) or ^L (with vi, or less) after the paragraph I want to extract (the numbers after the * line and before <np>), so I don't know how to deal with it......
    • mikeserv
      mikeserv almost 9 years
      What? So does a literal <np> in the file you wish to edit delimit the end of the paragraph - or is it a blank line instead? If you don't know the answer to that question then you have asked the wrong question. You need to first ask: What the hell is <np> anyway?
    • JoVe
      JoVe almost 9 years
      The first problem I encountered was the **** line. The <np> comes after. I don't think it's a literal <np>, it must mean something else.
    • terdon
      terdon almost 9 years
      @JoVe please show us your desired output. It really isn't clear from your description. Make sure to put it in a code block.
    • terdon
      terdon almost 9 years
      OK, see updated answer.
    • kos
      kos almost 9 years
      How many paragraphs (or *** separated sections) are are we expected to expect to be in your output?
    • kos
      kos almost 9 years
      Also, are *** separated sections always terminated by a <np> line?
    • JoVe
      JoVe almost 9 years
      There is only one paragraph to extract, the one between the line of *'s and the <np> line (special character I dont know the meaning of). But I need to parse different files of the same type as the file I put as an example.
  • terdon
    terdon almost 9 years
    @mikeserv because the title states "paragraph separated by ***" and because the output only shows the first set of lines. Since the OP refers to it as "the paragraph after ***", I assumed that the *** was between each paragraph. If that's not the case, the OP can use fredtantini's very good answer.
  • JoVe
    JoVe almost 9 years
    The way I was doing it, I treated the entire file as one record (hence the RS=""), and I used FS to read the paragraph I want (read as a field by awk). It worked until I had to deal with a "******" separator. Do you know why ? Even when I double escape the * it's not working. Your solution looks quicker though : can you explain how is it working ? I am not familiar with the -vRS option.
  • terdon
    terdon almost 9 years
    @JoVe the -v lets you set a variable. For example awk -v g="foo" 'BEGIN{print g} will return foo since that's the value of the variable g. I just wrote it with no space and used it to set the RS variable. What ******? I don't see that in your example. If you show us the actual file you are trying to parse (including this problematic ******), I can update my answer.
  • JoVe
    JoVe almost 9 years
    I edited my post : I put an extract of the file and an example of the awk command I used.
  • terdon
    terdon almost 9 years
    @JoVe please use the formatting tools to format your post. Just paste it directly into your question, select it and then hit the {} button.
  • terdon
    terdon almost 9 years
    @mikeserv why? To avoid the quoting issues?
  • mikeserv
    mikeserv almost 9 years
    It's mostly a hunch, actually, but this is all I know about it. I dunno if awk does anything with * as it does for \t, really. But according to that ENVIRON is one of only two ways to hand it string literals.
  • JoVe
    JoVe almost 9 years
    Thanks for the edited answer, but there are still a few problems left. awk '/\*{4,}/,/<np>/' file is not working but awk '/\*\*\*\*/,/<np>/' file works, why ? <np> seems to be a special character, it is not recognized by awk. But I can parse until "Predicted orbit" and remove the last 3 lines.
  • mikeserv
    mikeserv almost 9 years
    By the way - the tail|head thing doesn't work for anything but the first paragraph - it still prints ***\n.*\n<np> for all but the first paragraph.
  • terdon
    terdon almost 9 years
    @JoVe huh? What is <np>? Is that an actual <, n, p and > or is it supposed to represent something else? There's nothing special about it. It's possible that your awk flavor/version doesn't support {N,} but I can't know since you haven't told us anything about it. What OS are you on?
  • JoVe
    JoVe almost 9 years
    @mikeserv I'm a beginner in all this, so instead of reacting like that, tell me what is stupid in my question so I can learn. In my beginner's head, \*{4,} not working can happen because of 2 things: either it does not recognize * (so I tested it to be sure it recognizes it) or {4,} is not working and I find it strange. How is that stupid ?? Anyway it's not important. To Terdon: Well it's not a <, n, p and >, it is supposed to represent something else but I dont know what. Again, it's not important. Sorry to have wasted your time and thank you for your help !
  • JoVe
    JoVe almost 9 years
    @mikeserv Yes I understand. I'll do better next time. ;) I've get round the problem of <np> by doing this: '/\*\*/ {parse=1; next} /^[^2]/ {parse=0} parse==1 {print}' file. Thanks again !
  • terdon
    terdon almost 9 years
    @JoVe OK, so, presumably, your <np> is actually "lines that don't begin with a 2" (that's what you're matching, anyway). Could you please accept one of the answers by clicking on the check mark to the left? That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
  • mikeserv
    mikeserv almost 9 years
    Does this handle the sample data provided?
  • kos
    kos almost 9 years
    @mikeserv Yes, I've added a sample output of the command
  • mikeserv
    mikeserv almost 9 years
    Hmmm... That doesn't look like the sample I see... Oh, I guess it is still here in the question. The op edited to provide a more accurately representative sample. Have you tried it there?
  • kos
    kos almost 9 years
    @mikeserv No I didn't, I actually didn't even notice that the question was updated (I have left this open for a while before answering). It won't work on the updated input file, so I'm editing it accordingly, thanks
  • mikeserv
    mikeserv almost 9 years
    Is the \n(.*) greedy? Will it not edit out all but the last occurrence of a matching paragraph? Or, I guess that should be, will it include everything between paragraph 1 and 2 if there are two possible matches for *[*]*\n(.*)\n<np>?
  • kos
    kos almost 9 years
    @mikeserv Hm, I'm having troubles understanding. It will match only the last occurence of the target pattern if there are multiple possible matches. This however doesn't seem to be likely to happen, at least judging from the sample input
  • mikeserv
    mikeserv almost 9 years
    Yeah - I'm unclear on that too - the question just really did a 180. I guess I'm still holding onto my initial impression from first reading it - which was how to print only a paragraph in a file following a unique marker. I ask though because it looks like *[*]\n matches the first occurrence of that string in input and (.*) matches everything up to the last \n<np>. And so if there were two - would it not get paragraph1 <np> middle junk \*\*\* paragraph2? I'm useless with w/ perl by the way, and so I really honestly curious.
  • kos
    kos almost 9 years
    @mikeserv Perl is greedy by default, and in my experience specifying an exact pattern after a greedy pattern (say .*) will match until the last occurence of the specified pattern and continue matching from there, so in this case the .* at the start itself will match until the last occurence of [*]+\n, in fact it will only print the last * separated section, and (.*) will match until the last occurence of <np>\n, which is (presumably) going to be only one per each * separated section
  • kos
    kos almost 9 years
    @mikeserv However I asked OP about this also
  • mikeserv
    mikeserv almost 9 years
    Oh, yeah - i didn't notice the leading .* - so, yeah, that squeezes it down to the last match.