Extracting a specific string after a given string from HTML file using a bash script

41,383

Solution 1

I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with

sed -nr '/MOM:/ s/.*MOM:([^"]+).*/\1/p' file

It works OK on your sample anyway...

Notes

  • -n don't print anything until we ask for it
  • -r use ERE
  • /string/ find lines with string
  • s/old/new/ replace old with new
  • .* any number of any characters
  • ([^"]+) save some characters that are not "
  • \1 backreference to saved characters
  • p print just the lines we changed

Solution 2

grep -Po 'MOM:\K[^"]+' file.html

Warning: this is not a very robust solution; And your HTML is not valid

Solution 3

The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.

I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.

With grep

This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.

y="$(grep -oPm1 'MOM:\K\w+' "$file")"

-oPm1 is just a more compact way to write -o -P -m 1.

Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.

If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br> then $y will hold the value:

MANIKA
JANE

With sed

This resembles Zanna's method.

y="$(sed -rn '0,/.*MOM:(\w+).*/ s//\1/p' "$file")"

Besides being enclosed as a command substitution, the differences are that I:

  • stop after the first line that contains a match
  • match one or more word characters (\w+) instead of characters up to a " ([^"]+)
  • consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line
  • use a more compact syntax that avoids writing the pattern twice.

The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.

If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br> you get:

JANE
Share:
41,383

Related videos on Youtube

Abhijeet Anand
Author by

Abhijeet Anand

Updated on September 18, 2022

Comments

  • Abhijeet Anand
    Abhijeet Anand over 1 year

    I have a HTML file momcpy.html from which I want to extract a specific string after a given string. File content is like:

    <tr><br>
    <th height="12" bgcolor="#808080"><label for="<br>
     LSCRM:Abhijeet<br>
     <br>
     MCRM:Bhargav<br>
     <br>
     TLGAPI:GAURAVAURAV<br>
     <br>
     MOM:MANIKA"></td><br>
    

    This is present on one of the lines of HTML.

    I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.

    I have tried:

    file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
      y=$( awk '$1=="MOM:"{print $2}' $file)
     echo "$y"
    

    But that didn't work.

    • Abhijeet Anand
      Abhijeet Anand over 6 years
      Here i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
    • Abhijeet Anand
      Abhijeet Anand over 6 years
      edited question
    • David Foerster
      David Foerster over 6 years
      This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
  • Abhijeet Anand
    Abhijeet Anand over 6 years
    Thats really appreciable, but i just want string Manika, here m getting mANIKA\nmANIKA</td><br>
  • Abhijeet Anand
    Abhijeet Anand over 6 years
    working fine, ty :)
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 5 years
    +1 and I agree about HTML caution. However this answer is great for other applications.