HTMLAgilityPack iterate all text nodes only
11,916
Solution 1
Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']"))
{
Console.WriteLine(node.InnerText.Trim());
}
Will output this:
Select your Age:
0 to 10
20 and above
Help/Hints:
This is required field.
Make sure select the right age.
Learn More
Solution 2
I tested @Simon Mourier's answer on the Google home page and got lots of CSS and Javascript, so I added an extra filter to remove it:
public string getBodyText(string html)
{
string str = "";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
try
{
// Remove script & style nodes
doc.DocumentNode.Descendants().Where( n => n.Name == "script" || n.Name == "style" ).ToList().ForEach(n => n.Remove());
// Simon Mourier's Answer
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']"))
{
str += node.InnerText.Trim() + " ";
}
}
catch (Exception)
{
}
return str;
}
Author by
Sha Le
Updated on June 27, 2022Comments
-
Sha Le almost 2 years
Here is a HTML snippet and all I want is to get only the text nodes and iterate them. Pls let me know. Thanks.
<div> <div> Select your Age: <select> <option>0 to 10</option> <option>20 and above</option> </select> </div> <div> Help/Hints: <ul> <li>This is required field. <li>Make sure select the right age. </ul> <a href="#">Learn More</a> </div> </div>
Result:
- Select your Age:
- 0 to 10
- 20 and above
- Help/Hints:
- This is required field.
- Make sure select the right age.
- Learn More
-
8oris about 2 yearsTrying to implement your code, i got an "BC30491: Expression does not produce a value" error on n.Remove()