Regex Extract html Body
Solution 1
Don't use a regular expression for this - use something like the Html Agility Pack.
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Then you can extract the body
with an XPATH.
Solution 2
How about something like this?
It captures everything between <body></body>
tags (case insensitive due to RegexOptions.IgnoreCase
) into a group named theBody
.
RegexOptions.Singleline
allows us to handle multiline HTML as a single string.
If the HTML does not contain <body></body>
tags, the Success
property of the match will be false.
string html;
// Populate the html string here
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );
Match match = regx.Match( html );
if ( match.Success ) {
string theBody = match.Groups["theBody"].Value;
}
![Bruce Adams](https://i.stack.imgur.com/YmeDR.png?s=256&g=1)
Comments
-
Bruce Adams about 2 years
How would I use Regex to extract the body from a html doc, taking into account that the html and body tags might be in uppercase, lowercase or might not exist?
-
M4N about 15 yearsDuplicate of stackoverflow.com/questions/356340/… ?
-
-
Saif Khan about 15 yearsI agree. I've used this and must say it's fast, neat and clean.
-
Darryl about 13 yearsThank you! That's what I strive for.
-
Thomas Amar over 11 yearsGreat, that does exactly what I needed.
-
Lightweight over 10 yearsThanks for answering the question!
-
Quango over 10 yearsA good simple solution, but beware of body tags with spaces or attributes: < body id='content'> would not match
-
ShaileshDev over 7 yearsPlease provide detail solution.