C# parse html with xpath

22,296

Solution 1

Try to use the next xpath //tr[preceding-sibling::tr[@class='LomakeTaustaVari']]:

var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[@class='LomakeTaustaVari']]");

It should select nodes that have preceding node tr with class LomakeTaustaVari.

Just FYI: if no nodes found, SelectNodes method returns null.

Solution 2

If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.

You can navigate to the parent and then find all its <tr> children:

lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed

You can also use NextSibling to get the next <tr>:

var trWithoutClass = lomakeTaustaVariElement.NextSibling;

Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.

To overcome this, you may recursively call NextSibling until you encounter a tr element.

Solution 3

This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.

foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr")) 
{
    Console.WriteLine(row.InnerText);     
}
Share:
22,296
Admin
Author by

Admin

Updated on March 04, 2021

Comments

  • Admin
    Admin about 3 years

    I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?

    Here's a piece of the HTML, it repeats it self whit different values.

    <tr class="LomakeTaustaVari">
        <td><div class="Ensimmainen">12:09</div></td>
        <td><div>MSI</div></td>
        <td><div>POH</div></td>
        <td><div>42</div></td>
        <td><div>64,50</div></td>
    </tr>
    <tr>
        <td><div class="Ensimmainen">12:09</div></td>
        <td><div>SRE</div></td>
        <td><div>POH</div></td>
        <td><div>156</div></td>
        <td><div>64,50</div></td>
    </tr>
    

    My C# code:

    {
        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
    
        foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[@class='LomakeTaustaVari']")) 
        {
            Console.WriteLine(row.InnerText);     
        }
        Console.ReadKey();
    }
    
  • Admin
    Admin over 10 years
    Thank you, this did the trick... ..Allmost, it skipped the the first row but gave me all the entries when I changed the LomakeTaustaVari to the parent class TaulukkoOtsikkorivi (outside the html code i gave you). This definitely pointed me in the right direction.