How can I extract just text from the html

c# html-agility-pack

11,807

Solution 1

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

Solution 2

How about using the XPath expression '//body//text()' to select all text nodes?

Solution 3

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Solution 4

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

View more solutions

11,807

Author by

TCM

Updated on July 06, 2022

Comments

TCM almost 2 years
I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
```
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>
```
The output should be :-
```
This is a big title. How are doing you? I am fine
```
I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)
Richard Schneider almost 13 years

Note htat "/html/body" for xpath is much faster.
ShaileshDev about 7 years

It's giving error. Unable to find namespace for HtmlDocument .
Kobi about 7 years

@Er.ShaileshS.Bankar - Do you have the Html Agility Pack library?
ShaileshDev about 7 years

No, do I have to add it firts?
Xavier Poinas over 6 years

It seems to be using HtmlAgilityPack under the hood, as suggested by the accepted answer.
xoofx over 6 years

@XavierPoinas no, NUglify is not using HtmlAgilityPack, it has its own HTML5 custom parser.
Xavier Poinas over 6 years

Sorry, you're right. I saw it in the project but it's only there for benchmarking purposes.