htmlagilitypack - remove script and style?

22,554

Solution 1

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());

Solution 2

You can do so using HtmlDocument class:

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(input);

doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());

Solution 3

Some excellent answers, System.Linq is handy!

For a non Linq based approach:

private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument)
{

// Get all Nodes: script
HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script");

// Make sure not Null:
if (Nodes == null)
    return webDocument;

// Remove all Nodes:
foreach (HtmlNode node in Nodes)
    node.Remove();

return webDocument;

}
Share:
22,554
Jacqueline
Author by

Jacqueline

Updated on August 23, 2022

Comments

  • Jacqueline
    Jacqueline over 1 year

    Im using the following method to extract text form html:

        public string getAllText(string _html)
        {
            string _allText = "";
            try
            {
                HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.LoadHtml(_html);
    
    
                var root = document.DocumentNode;
                var sb = new StringBuilder();
                foreach (var node in root.DescendantNodesAndSelf())
                {
                    if (!node.HasChildNodes)
                    {
                        string text = node.InnerText;
                        if (!string.IsNullOrEmpty(text))
                            sb.AppendLine(text.Trim());
                    }
                }
    
                _allText = sb.ToString();
    
            }
            catch (Exception)
            {
            }
    
            _allText = System.Web.HttpUtility.HtmlDecode(_allText);
    
            return _allText;
        }
    

    Problem is that i also get script and style tags.

    How could i exclude them?

  • Jacqueline
    Jacqueline over 11 years
    How do i foreach through that?
  • L.B
    L.B over 11 years
    @Jacqueline When you run above code. All script and style tags will be removed from doc
  • Jacqueline
    Jacqueline over 11 years
    ahh i see, can it be extenede to support comments such as <!-- comment --> also?
  • L.B
    L.B over 11 years
    @Jacqueline .Where(n => n.Name == "script" || n.Name == "style" || n.Name=="#comment")
  • The Muffin Man
    The Muffin Man over 9 years
    Does the Name property need to be compared case insensitive? I'm thinking the attacker could have used <SCRIPT>.
  • MonkeyDreamzzz
    MonkeyDreamzzz over 6 years
    Shouldn't it be doc.DocumentNode.SelectNodes("//style|//script").ToList().Fo‌​rEach(n => n.Remove());?
  • johnw86
    johnw86 about 6 years
    @Rubanov Yeah it should be, I had an extension method so I didn't require the .ToList in my code. Answer updated, thanks.