How can I extract just text from the html
Solution 1
You can use the body's InnerText
:
string html = @"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, @"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world
or hello<i>world</i>
will be converted by InnerText
to helloworld
- removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
Solution 2
How about using the XPath expression '//body//text()'
to select all text nodes?
Solution 3
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.
Solution 4
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
TCM
Updated on July 06, 2022Comments
-
TCM almost 2 years
I have a requirement to extract all the text that is present in the
<body>
of the html. Sample Html input :-<html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src="abc.jpg"/> </body> </html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
-
Richard Schneider almost 13 yearsNote htat "/html/body" for xpath is much faster.
-
ShaileshDev about 7 yearsIt's giving error. Unable to find namespace for HtmlDocument .
-
Kobi about 7 years@Er.ShaileshS.Bankar - Do you have the Html Agility Pack library?
-
ShaileshDev about 7 yearsNo, do I have to add it firts?
-
Xavier Poinas over 6 yearsIt seems to be using
HtmlAgilityPack
under the hood, as suggested by the accepted answer. -
xoofx over 6 years@XavierPoinas no, NUglify is not using
HtmlAgilityPack
, it has its own HTML5 custom parser. -
Xavier Poinas over 6 yearsSorry, you're right. I saw it in the project but it's only there for benchmarking purposes.