How do I filter all HTML tags except a certain whitelist?

30,874

Solution 1

Here's a function I wrote for this task:

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

Edit: For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.

I will explain the regex a bit, because it is a little long.

The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).

Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+

Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:\s[a-z,A-Z,0-9,-]+=?(?:(["",']?).?\1?))

Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+

next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later \1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.

next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with \s, and 0 or 1 ending slashes in the tag for xml style self closing tags.

You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.

Solution 2

This is a good working example on html tag filtering:

Sanitize HTML

Solution 3

Attributes are the major problem with using regexes to try to work with HTML. Consider the sheer number of potential attributes, and the fact that most of them are optional, and also the fact that they can appear in any order, and the fact that ">" is a legal character in quoted attribute values. When you start trying to take all of that into account, the regex you'd need to deal with it all will quickly become unmanageable.

What I would do instead is use an event-based HTML parser, or one that gives you a DOM tree that you can walk through.

Solution 4

I just noticed the current solution allows tags that start with any of the acceptable tags. Thus, if "b" is an acceptable tag, "blink" is too. Not a huge deal, but something to consider if you are strict about how you filter HTML. You certainly wouldn't want to allow "s" as an acceptable tag, as it would allow "script".

Solution 5

The reason that adding the word boundary \b didn't work is that you didn't put it inside the lookahead. Thus, \b will be attempted after < where it will always match if the < starts an HTML tag.

Put it inside the lookahead like this:

<(?!/?(i|b|h3|h4|a|img)\b)[^>]+>

This also shows how you can put the / before the list of tags, rather than with each tag.

Share:
30,874
Fear605
Author by

Fear605

ASP.NET (VB.NET, C#, SQL Server) developer.

Updated on July 16, 2020

Comments

  • Fear605
    Fear605 almost 4 years

    This is for .NET. IgnoreCase is set and MultiLine is NOT set.

    Usually I'm decent at regex, maybe I'm running low on caffeine...

    Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:

    u, i, b, h3, h4, br, a, img
    

    Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.

    I want to:

    1. Strip all starting and ending HTML tags other than those listed above.
    2. Remove attributes from the remaining tags, except anchors can have an href.

    My search pattern (replaced with an empty string) so far:

    <(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>
    

    This seems to be stripping all but the start and end tags I want, but there are three problems:

    1. Having to include the end tag version of each allowed tag is ugly.
    2. The attributes survive. Can this happen in a single replacement?
    3. Tags starting with the allowed tag names slip through. E.g., "<abbrev>" and "<iframe>".

    The following suggested pattern does not strip out tags that have no attributes.

    </?(?!i|b|h3|h4|a|img)\b[^>]*>
    

    As mentioned below, ">" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.

    Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):

    static string SanitizeHtml(string html)
    {
        string acceptable = "script|link|title";
        string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
        return Regex.Replace(html, stringPattern, "sausage");
    }
    

    Some small tweaks I think could still be made to this answer:

    1. I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "\s--".

    2. I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).

    Edit 2009-07-23: Here's the final solution I went with (in VB.NET):

     Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
     Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
          ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
     html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
    

    The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.

  • Fear605
    Fear605 over 15 years
    lol... there's still a comma in the last character range. Thanks for the update! I adjusted the code in the OP.
  • sohtimsso1970
    sohtimsso1970 over 11 years
    The RefactorMyCode website has been down for awhile. I believe it's no longer in service.
  • Christian C. Salvadó
    Christian C. Salvadó over 11 years
    @sohtimsso1970, yeah, I haven't noticed until now, here's the archived webpage from September 2010: web.archive.org/web/20100901160940/http://refactormycode.com‌​/…
  • Admin
    Admin over 11 years
    Could you please add some explanation why and how this answers the quetion?
  • Saber
    Saber over 11 years
    Thanks for the code! Is this code updated or the comma should be removed from the expression?
  • BikerP
    BikerP over 10 years
    just to add a note of caution, i had my html input for this come from an external source, it had an invalid br tag "<br<", this caused the regex to go into an infinite loop. So ensure you validate your html before passing it in.
  • Issac
    Issac almost 9 years
    Works very nice. I tweaked the Regex a bit to include smart tags (Office format mostly like <o:p>, <st1:*>). string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+:?[a-zA-Z0-9]+?)(?:\s[a-zA-Z0-9\-]+=?(?‌​:(["",']?).*?\1?)?)*‌​\s*/?>";
  • Tim Maxey
    Tim Maxey over 8 years
    This solution actually worked for what I needed. I need to strip all html except for a(link) tags... string[] ignorableTags = {"a"}; StripHtml(mytextwithlinks, true, ignorableTags);
  • Tedd Hansen
    Tedd Hansen almost 8 years
    This is a very bad solution. Not only will it mess up your HTML code, but it actually only removes tags if they have a strict closing tag. So simply putting a space after the closing tag will allow malicious code: <script>alert("EXPLOIT");</script > And it attempts to be a blacklister, not a whitelister. So any unknowns will pass through it gladly.
  • Tedd Hansen
    Tedd Hansen almost 8 years
    Same as my comment on accepted answer: Not secure, easily bypassed.
  • Tedd Hansen
    Tedd Hansen almost 8 years
    Looking at the code this is the strictest and best of the regex answers I've seen here. I can't see any immediate flaw in it, although I would recommend against attempting HTML sanitizing with regex.
  • ahjashish
    ahjashish almost 8 years
    Tedd, this answer is 8 years old. If you have a better way, feel free to post your own answer.
  • ahjashish
    ahjashish almost 8 years
    The actual answer should be included in the post. If the link goes bad, this answer becomes worthless. Stack Overflow 101 people.
  • Chirag
    Chirag over 6 years
    I see what you are talking of just change the regex to <script[^<]*</script[ \n\t]*>