Combine multiple sed commands

17,972

Solution 1

Use the -e option (if using GNU sed). From the manual:

e [command] This command allows one to pipe input from a shell command into pattern space. Without parameters, the e command executes the command that is found in pattern space and replaces the pattern space with the output; a trailing newline is suppressed.

If a parameter is specified, instead, the e command interprets it as a command and sends its output to the output stream. The command can run across multiple lines, all but the last ending with a back-slash.

In both cases, the results are undefined if the command to be executed contains a NUL character.

Note that, unlike the r command, the output of the command will be printed immediately; the r command instead delays the output to the end of the current cycle.

So in your case you could do:

cat tmp.txt | grep '<td>[0-9]*.[0-9]' \
| sed -e 's/[\t ]//g' \
-e "s/<td>//g" \
-e "s/kB\/s\((.*)\)//g" \
-e "s/<\/td>//g" > traffic.txt

You can also write it in another way as:

grep "<td>.*</td>" tmp.txt | sed 's/<td>\([0-9.]\+\).*/\1/g'

The \+ matches one or more instances, but it does not work on non-GNU versions of sed. (Mac has BSD, for example)

With help from @tripleee's comment below, this is the most refined version I could get which will work on non-GNU versions of sed as well:

sed -n 's/<td>\([0-9]*.[0-9]*\).*/\1/p' tmp.txt

As a side note, you could also simply pipe the outputs through each sed instead of saving each output, which is what I see people generally do for ad-hoc tasks:

  cat tmp.txt | grep '<td>[0-9]*.[0-9]' \
    | sed -e 's/[\t ]//g' \
    | sed "s/<td>//g" \
    | sed "s/kB\/s\((.*)\)//g" \
    | sed "s/<\/td>//g" > traffic.txt

The -e option is more efficient, but the piping option is more convenient I guess.

Solution 2

This might work for you (GNU sed):

 sed '/^<tr/,/^<\/tr>/!d;/<td/H;/^<\/tr/!d;x;s/\n//g;s/<td>/\n/2;s/.*\n\(\S*\).*/\1/' file

Explanation:

  • Focus on lines between start <tr> and end </tr> tags. /^<tr/,/^<\/tr>/!d
  • Store <td> lines in the hold space (HS). /<td/H
  • Delete all lines in range except the last. /^<\/tr/!d
  • Swap to HS. x
  • Delete all newlines. s/\n//g
  • Replace 2nd <td> with a newline. s/<td>/\n/2
  • Delete all text in the HS except for the first non-space field following the inserted newline and print. s/.*\n\(\S*\).*/\1/

Solution 3

You can use braces to create a block which is operated on by an address or set of addresses:

sed -n '/<td>[0-9]*.[0-9]/ {s/[\t ]//g; s/<td>//g; s/kB\/s\((.*)\)<\/td>//g;p}' tmp.txt

I think that you can probably do something tricky with sed's hold and pattern spaces in order to get the second and 4th lines, (I've seen solutions which can undo double-spacing of files this way).

Solution 4

Your questions about running multiple sed appear to have been answered, but sed is the wrong tool for this. Assuming the input format is rigid, and <tr> is always at the start of a line and the td tags you are looking for are always preceded by exactly 2 spaces on the line (this solution can easily be modified if that is not the case), you can do:

awk -F'</?td>' '/^<tr/{i=0} /^  <td/{i++} i==2{print $2}' input-file

The first argument tells awk to split each line on either <td> or </td>, so the data you are interested in becomes the 2nd field. The first clause of the 2nd argument resets the counter i to zero whenever <tr appears at the start of a line. The next increments i each time <td appears after 2 spaces. The last prints the 2nd field for the 2nd <td> line. And the last argument specifies your input file.

Of course, that gives you everything between the <td> tags, which I see is not what you want. To just get the chunk of text between <td> and the first whitespace, try:

awk '/^<tr/{i=0} /^  <td/{i++} i==2{gsub( "<td>", ""); print $1}' input-file

Solution 5

[Edit] Thanks to Barton for pointing out the mistake. Corrected version:

cat tmp.txt | grep td | sed 's/<td>\([0-9]\.[0-9]\).*/\1/g' > newtmp.txt
sed -n '2,${p;n;n}' newtmp.txt > final.txt; rm newtmp.txt

The first line will pick out the digit.digit pattern after td on each line.

The second line prints every third line starting from the second line (which effectively gives you the second line out of every group of three in the file).

Share:
17,972
Marley
Author by

Marley

Updated on June 04, 2022

Comments

  • Marley
    Marley almost 2 years

    having the following file:

    <tr class="in">
      <th scope="row">In</th>
      <td>1.2 kB/s (0.0%)</td>
      <td>8.3 kB/s (0.0%) </td>
      <td>3.2 kB/s (0.0%) </td>
    </tr>
    <tr class="out">
      <th scope="row">Out</th>
      <td>6.7 kB/s (0.6%) </td>
      <td>4.2 kB/s (0.1%) </td>
      <td>1.5 kB/s (0.6%) </td>
    </tr>
    

    I want to get the values between each second <td></td> (and save it to a file) like this:

    8.3
    4.2
    

    My code so far:

    # get the lines with <td> tags
    cat tmp.txt | grep '<td>[0-9]*.[0-9]' > tmp2.txt
    
    # delete whitespaces
    sed -i 's/[\t ]//g' tmp2.txt
    
    # remove <td> tag
    cat tmp2.txt | sed "s/<td>//g" > tmp3.txt
    
    # remove "kB/s (0.0%)"
    cat tmp3.txt | sed "s/kB\/s\((.*)\)//g" > tmp4.txt
    
    # remove </td> tag and save to traffic.txt
    cat tmp4.txt | sed "s/<\/td>//g" > traffic.txt
    
    #rm -R -f tmp*
    

    How can I do this the common way? This code is really noobish..

    Thanks in Advance, Marley