Extracting a specific string after a given string from HTML file using a bash script
Solution 1
I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA
with
sed -nr '/MOM:/ s/.*MOM:([^"]+).*/\1/p' file
It works OK on your sample anyway...
Notes
-
-n
don't print anything until we ask for it -
-r
use ERE -
/string/
find lines withstring
-
s/old/new/
replaceold
withnew
-
.*
any number of any characters -
([^"]+)
save some characters that are not"
-
\1
backreference to saved characters -
p
print just the lines we changed
Solution 2
grep -Po 'MOM:\K[^"]+' file.html
Warning: this is not a very robust solution; And your HTML is not valid
Solution 3
The string you're looking for always has MOM:
before it, but you have not said if it always has "
after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file
, so I will assume this has been done. You've assigned the output of your command to $y
, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:\K\w+' "$file")"
-oPm1
is just a more compact way to write -o -P -m 1
.
-
-o
prints only the matches, not the whole line. -
-P
uses PCRE, which supports\K
to drop text matched so far so it's not included in the matched text that is returned. -
-m 1
stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1
to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep
method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br>
then $y
will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(\w+).*/ s//\1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
\w+
) instead of characters up to a"
([^"]+
) - consume zero or more arbitrary characters (
.*
) first, so thatMOM:
doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed
, but that's the sed
implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed
method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br>
you get:
JANE
Related videos on Youtube
Abhijeet Anand
Updated on September 18, 2022Comments
-
Abhijeet Anand over 1 year
I have a HTML file
momcpy.html
from which I want to extract a specific string after a given string. File content is like:<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA"></td><br>
This is present on one of the lines of HTML.
I want to extract
Manika
and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.I have tried:
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html' y=$( awk '$1=="MOM:"{print $2}' $file) echo "$y"
But that didn't work.
-
Abhijeet Anand over 6 yearsHere i want to extract string after "MOM:"<tr><br> <th height="12" bgcolor="#808080"><label for="<br> LSCRM:Abhijeet<br> <br> MCRM:Bhargav<br> <br> TLGAPI:GAURAVAURAV<br> <br> MOM:MANIKA">Agenda:</label></th><br>
-
Abhijeet Anand over 6 yearsedited question
-
David Foerster over 6 yearsThis document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped
<
or>
and thefor
attribute of thelabel
element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
-
-
Abhijeet Anand over 6 yearsThats really appreciable, but i just want string Manika, here m getting mANIKA\nmANIKA</td><br>
-
Abhijeet Anand over 6 yearsworking fine, ty :)
-
WinEunuuchs2Unix over 5 years+1 and I agree about HTML caution. However this answer is great for other applications.