How can I strip HTML tags from a string in ASP.NET?

143,468

Solution 1

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

Note:

  1. There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
  2. The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
  3. As with all things HTML and regex:
    Use a proper parser if you must get it right under all circumstances.

Solution 2

Go download HTMLAgilityPack, now! ;) Download LInk

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

Here is a sample:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerText;
            }

Solution 3

Regex.Replace(htmlText, "<.*?>", string.Empty);

Solution 4

protected string StripHtml(string Txt)
{
    return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}    

Protected Function StripHtml(Txt as String) as String
    Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

Solution 5

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
Share:
143,468
daniel
Author by

daniel

Updated on January 15, 2020

Comments

  • daniel
    daniel over 4 years

    Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.

    Example:

    <ul><li>Hello</li></ul>

    Output:

    "Hello"

    I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

  • Axarydax
    Axarydax about 13 years
    this doesn't seem to work, I tested it with simple InnerHtml="<b>foo</b>"; and InnerText has value "<b>foo</b>" :(
  • jessehouwing
    jessehouwing about 12 years
    you can even query every text() node, trim the contents and string.Join those with space. IEnumerable<string> allText = doc.DocumentNode.SelectNodes("//text()").Select(n => n.InnerText.Trim())
  • jessehouwing
    jessehouwing about 12 years
    or simply use doc.DocumentNode.InnerText, though this has some issues with whitespacehandling it seems...
  • avesse
    avesse about 12 years
    Why the if (doc == null) check? This is always false, not so?
  • Yahoo Serious
    Yahoo Serious over 11 years
    Although not requested, I think a lot of readers will want to also strip HTM-encoding, like &quote;. I combine it with WebUtility.HtmlDecode for that (which in turn will not remove tags). Use it after tag-removal, since it may rewrite &gt; and &lt;. E.g. WebUtility.HtmlDecode(Regex.Replace(myTextVariable, "<[^>]*(>|$)", string.Empty))
  • ChrisF
    ChrisF about 11 years
    Has many issues - doesn't deal with attributes having < or > in them and doesn't do well with tags that span more than one line unless run with RegexOptions.SingleLine.
  • ChrisF
    ChrisF about 11 years
    Doesn't work for lots of cases including non-unix linebreaks.
  • Sven Grosen
    Sven Grosen almost 10 years
    As @Serpiton points out, there isn't such a method in the BCL. Could you point to an implementation of this method or provide your own?
  • SearchForKnowledge
    SearchForKnowledge about 9 years
    @YahooSerious Thank you for providing an example. This works great. Thank you.
  • Rama
    Rama almost 9 years
    Don't do this. This solution injects un-encoded html directly into the output. This would leave you wide open to Cross Site Scripting attacks - you have just allowed anyone that can change the html string to inject any arbitrary html and javascript into your application!
  • Bojangles
    Bojangles almost 9 years
    Html Agility Pack is the way to go, I used it way back in webforms to strip entire web pages to use content!
  • Lemdor
    Lemdor over 7 years
    Totally a newbie at this. how would I implement the above webUtility.HtmlDecode code into my source. I am using CkEditor
  • Admin
    Admin about 7 years
    @YahooSerious this will allow a XSS vector in however &gt; script &lt; alert("XXS"); &gt; / script &lt; Will not be sanitized by the regex but converted by HtmlDecode to <script>alert("XXS");</ script>
  • Tomalak
    Tomalak about 7 years
    @Heather Very good point. HTML tag stripping would have to be done again after entity decoding.
  • Paul Kienitz
    Paul Kienitz about 6 years
    Noooo, use "<[^>]*>".