Remove HTML tags from string including &nbsp in C#

142,509

Solution 1

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Solution 2

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

Solution 3

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

Solution 4

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

Solution 5

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

Share:
142,509

Related videos on Youtube

rampuriyaaa
Author by

rampuriyaaa

foodie, film buff and an inquisitive coder!!!!

Updated on April 09, 2020

Comments

  • rampuriyaaa
    rampuriyaaa about 4 years

    How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

      "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
    
  • Jimmy
    Jimmy about 10 years
    Just to confirm: the SingleSpacedTrim() function does the same thing as string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " "); from Ravi Thapliyal's answer?
  • David S.
    David S. about 10 years
    @Jimmy as far as I can see, that regex doesn't catch single tabs or newlines like SingleSpacedTrim() does. That could be a desirable effect though, in that case just remove the cases as needed.
  • Mahesh Malpani
    Mahesh Malpani about 9 years
    Regex.Replace(inputHTML, @"<[^>]+>|&nbsp|\n;", "").Trim(); \n is not getting removed
  • Ravi K Thapliyal
    Ravi K Thapliyal about 9 years
    @MaheshMalpani I tried and it works with newlines too. Try using \r or \r\n instead because your input maybe coming from a non-Unix platform.
  • Tauseef
    Tauseef almost 9 years
    I would recommend to ad a space rather than an empty string, we are catching out extra spaces any way Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", " ")
  • Ravi K Thapliyal
    Ravi K Thapliyal over 8 years
    @Tauseef If you use a space in the first replace call, you may end up leaving spaces where there were none in the original input. Say you receive Sound<b>Cloud</b> as an input; you'll end up with Sound Cloud while it should've been stripped as SoundCloud because that's how it gets displayed in HTML.
  • Ehsan88
    Ehsan88 over 5 years
    @Revious I think you are right. Maybe my answer is not related much to the OP's question as they did not mention the purpose of removing html tags. But if the purpose is to prevent attacks, as it is in many cases, then using an already developed sanitizer may be a better approach. BTW I have no knowledge about what the meaning of normalizing html is.

Related