How can I extract just text from the html

11,807

Solution 1

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

Solution 2

How about using the XPath expression '//body//text()' to select all text nodes?

Solution 3

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Solution 4

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

Share:
11,807
TCM
Author by

TCM

Updated on July 06, 2022

Comments

  • TCM
    TCM almost 2 years

    I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

    <html>
        <title>title</title>
        <body>
               <h1> This is a big title.</h1>
               How are doing you?
               <h3> I am fine </h3>
               <img src="abc.jpg"/>
        </body>
    </html>
    

    The output should be :-

    This is a big title. How are doing you? I am fine
    

    I want to use only HtmlAgility for this purpose. No regular expressions please.

    I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

    Thanks in advance :)

  • Richard Schneider
    Richard Schneider almost 13 years
    Note htat "/html/body" for xpath is much faster.
  • ShaileshDev
    ShaileshDev about 7 years
    It's giving error. Unable to find namespace for HtmlDocument .
  • Kobi
    Kobi about 7 years
    @Er.ShaileshS.Bankar - Do you have the Html Agility Pack library?
  • ShaileshDev
    ShaileshDev about 7 years
    No, do I have to add it firts?
  • Xavier Poinas
    Xavier Poinas over 6 years
    It seems to be using HtmlAgilityPack under the hood, as suggested by the accepted answer.
  • xoofx
    xoofx over 6 years
    @XavierPoinas no, NUglify is not using HtmlAgilityPack, it has its own HTML5 custom parser.
  • Xavier Poinas
    Xavier Poinas over 6 years
    Sorry, you're right. I saw it in the project but it's only there for benchmarking purposes.