htmlagilitypack - remove script and style?
22,554
Solution 1
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());
Solution 2
You can do so using HtmlDocument
class:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());
Solution 3
Some excellent answers, System.Linq is handy!
For a non Linq based approach:
private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument)
{
// Get all Nodes: script
HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script");
// Make sure not Null:
if (Nodes == null)
return webDocument;
// Remove all Nodes:
foreach (HtmlNode node in Nodes)
node.Remove();
return webDocument;
}
Author by
Jacqueline
Updated on August 23, 2022Comments
-
Jacqueline over 1 year
Im using the following method to extract text form html:
public string getAllText(string _html) { string _allText = ""; try { HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument(); document.LoadHtml(_html); var root = document.DocumentNode; var sb = new StringBuilder(); foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) sb.AppendLine(text.Trim()); } } _allText = sb.ToString(); } catch (Exception) { } _allText = System.Web.HttpUtility.HtmlDecode(_allText); return _allText; }
Problem is that i also get script and style tags.
How could i exclude them?
-
Jacqueline over 11 yearsHow do i foreach through that?
-
L.B over 11 years@Jacqueline When you run above code. All
script
andstyle
tags will be removed fromdoc
-
Jacqueline over 11 yearsahh i see, can it be extenede to support comments such as <!-- comment --> also?
-
L.B over 11 years@Jacqueline
.Where(n => n.Name == "script" || n.Name == "style" || n.Name=="#comment")
-
The Muffin Man over 9 yearsDoes the
Name
property need to be compared case insensitive? I'm thinking the attacker could have used<SCRIPT>
. -
MonkeyDreamzzz over 6 yearsShouldn't it be
doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());
? -
johnw86 about 6 years@Rubanov Yeah it should be, I had an extension method so I didn't require the .ToList in my code. Answer updated, thanks.