Extracting a specific string after a given string from HTML file using a bash script

command-line bash text-processing

41,383

Solution 1

I can't sensibly advise doing this, because parsing html with regex is not likely to end well but you might be able to get the string MANIKA with

sed -nr '/MOM:/ s/.*MOM:([^"]+).*/\1/p' file

It works OK on your sample anyway...

Notes

-n don't print anything until we ask for it
-r use ERE
/string/ find lines with string
s/old/new/ replace old with new
.* any number of any characters
([^"]+) save some characters that are not "
\1 backreference to saved characters
p print just the lines we changed

Solution 2

grep -Po 'MOM:\K[^"]+' file.html

Warning: this is not a very robust solution; And your HTML is not valid

Solution 3

The string you're looking for always has MOM: before it, but you have not said if it always has " after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.

I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file, so I will assume this has been done. You've assigned the output of your command to $y, so I will do the same.

With `grep`

This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.

y="$(grep -oPm1 'MOM:\K\w+' "$file")"

-oPm1 is just a more compact way to write -o -P -m 1.

-o prints only the matches, not the whole line.
-P uses PCRE, which supports \K to drop text matched so far so it's not included in the matched text that is returned.
-m 1 stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.

Note that you can also add -m1 to the command in JJoao's answer so it uses only matches from the first line that has any.

If the first line with a match contains multiple matches, this grep method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td>  then $y will hold the value:

MANIKA
JANE

With `sed`

This resembles Zanna's method.

y="$(sed -rn '0,/.*MOM:(\w+).*/ s//\1/p' "$file")"

Besides being enclosed as a command substitution, the differences are that I:

stop after the first line that contains a match
match one or more word characters (\w+) instead of characters up to a " ([^"]+)
consume zero or more arbitrary characters (.*) first, so that MOM: doesn't have to appear at the very beginning of the line
use a more compact syntax that avoids writing the pattern twice.

The technique I used for this requires GNU sed, but that's the sed implementation provided in Ubuntu.

If the first line with a match contains multiple matches, this sed method gives you just the last one. From MOM:MANIKA MOM:JANE"></td>  you get:

JANE

41,383

Abhijeet Anand

Updated on September 18, 2022

Comments

Abhijeet Anand over 1 year
I have a HTML file momcpy.html from which I want to extract a specific string after a given string. File content is like:
```
<tr> 
<th height="12" bgcolor="#808080"><label for=" 
 LSCRM:Abhijeet 
 
 MCRM:Bhargav 
 
 TLGAPI:GAURAVAURAV 
 
 MOM:MANIKA"></td> 
```
This is present on one of the lines of HTML.

I want to extract Manika and store it in a variable. So Basically I want to extract whatever string is present after MOM:, It could be dynamic.

I have tried:
```
file='/home/websphe/tomcat/webapps/MOM/web/momcpy.html'
 y=$( awk '$1=="MOM:"{print $2}' $file)
 echo "$y"
```
But that didn't work.
- Abhijeet Anand over 6 years
 
 Here i want to extract string after "MOM:"<tr> <th height="12" bgcolor="#808080"><label for=" LSCRM:Abhijeet MCRM:Bhargav TLGAPI:GAURAVAURAV MOM:MANIKA">Agenda:</label></th> 
- Abhijeet Anand over 6 years
 
 edited question
- David Foerster over 6 years
 
 This document is not well-formed HTML and thus not actually HTML. Element attributes may not contain unescaped < or > and the for attribute of the label element may only contain IDs of other (form) elements. It would also help if you included the full element tree from the root down to the element in question so that one may build a solution based on an actual (X)HTML parser.
Abhijeet Anand over 6 years

Thats really appreciable, but i just want string Manika, here m getting mANIKA\nmANIKA</td>
Abhijeet Anand over 6 years

working fine, ty :)
WinEunuuchs2Unix over 5 years

+1 and I agree about HTML caution. However this answer is great for other applications.