HTML Agility Pack - using XPath to get a single node - Object Reference not set to an instance of an object

33,996

Solution 1

You can't rely on a developper tools such as FireBug or Chrome, etc... to determine the XPATH for the nodes you're after, as the XPATH given by such tools correspond to the in memory HTML DOM while the Html Agility Pack only knows about the raw HTML sent back by the server.

What you need to do is look visually at what's sent back (or just do a view source). You'll see there is no TBODY element for example. So you want to find anything discriminant, and use XPATH axes for example. Also, your XPATH, even if it worked, would not be very resistant to changes in the document, so you need to find something more "stable" for the scraping to be more future-proof.

Here is a code that seems to work:

HtmlNode node = doc.DocumentNode.SelectSingleNode("//td[@class='dnTableCell']//a[text()='High']/../../td[3]");

This is what it does:

  • find a TD element with a CLASS attribute set to 'dnTableCell'. The // token means the search is recursive in the XML hierarchy.
  • find an A element that contains a text (inner text) equals to 'High'.
  • navigate two parents up (we'll get to the closest TR element)
  • select the 3rd TD element from there

Solution 2

like Simon Mourier explaind, you obtained the raw HTML sent by the server. The element which you need has not been rendered yet therefor you can't retrieve it yet because it does not exist in the DOM. a simple work around to this problem is to use a web renderer to build the DOM, than you can grab the HTML and scrape it. I use WatiN like this:

WatiN.Core.Settings.MakeNewInstanceVisible = false;
WatiN.Core.Settings.AutoMoveMousePointerToTopLeft = false; 
IE ie = new IE();
ie.GoTo(urlLink); 
ie.WaitForComplete();
string html = ie.Html;
ie.close();
Share:
33,996
dontpanic
Author by

dontpanic

Updated on November 15, 2020

Comments

  • dontpanic
    dontpanic over 3 years

    this is my first attempt to get an element value using HAP. I'm getting a null object error when I try to use InnerText.

    the URL I am scraping is :- http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013 I am trying to get the value for current high from the Day Change Summary Table.

    My code is at the bottom. Firstly, I would just like to know if I am going about this the right way? If so, then is it simply that my XPath value is incorrect?

    the XPath value was obtained using a utility I found called htmlagility helper. The firebug version of the XPath below, also gives the same error :- /html/body/div[3]/div/table/tbody/tr[3]/td/table/tbody/tr[5]/td[3]

    My code :-

    WebClient myPivotsWC = new WebClient();
    string nodeValue;
    string htmlCode = myPivotsWC.DownloadString("http://www.mypivots.com/dailynotes/symbol/659/-1/e-mini-sp500-june-2013");
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlCode);
    HtmlNode node = doc.DocumentNode.SelectSingleNode("/html[1]/body[1]/div[3]/div[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[5]/td[3]");
    nodeValue=(node.InnerText);
    

    Thanks, Will.

  • dontpanic
    dontpanic about 11 years
    Thanks heaps, that works. I will study your explanation of the XPath, I thought I would be able to get this done without actually learning how to use XPath, but clearly I will have to.
  • MattH
    MattH about 11 years
    @dontpanic: Learn xpath, ask questions on refining your xpath queries, there are lots of xpath Q&A on SO.
  • goodfella
    goodfella almost 7 years
    @Simon: So i'm having a similar issue and I tried following your suggestion and still no luck. Would you be able to show an example from any value from the table below in weather.deltixlab.com I have the following written: string day1 = doc.DocumentNode.SelectNodes("//*[@class='table-block']/../.‌​./../tr[1]/td[1]")[0‌​].InnerText;
  • Simon Mourier
    Simon Mourier almost 7 years
    @goodfella - from what I understand the site does not transport its table information in the HTML on the wire, cells are created using js/websocket, so you can't get these from HTML scraping
  • goodfella
    goodfella almost 7 years
    @Simon - Thanks for replying back on this old post. Just wanted to mention that when I tested the following xpath: //*[@id=\"free-data-table\"]/tbody/tr[1]/td[1] in XPATH Helper chrome plugin, it shows me the value but in HTMLAgilityPack its null. Are you saying when HTMLAgilityPack loads the url for parsing, the table is not brought in?
  • Simon Mourier
    Simon Mourier almost 7 years
    @goodfella - yep. this is very common pattern today, data is not in html but loaded dynamically in some way
  • goodfella
    goodfella almost 7 years
    Thank you, so basically I need to render the page completely to get the missing DOM. Now what worries me is the delay this will cause.
  • Tomer W
    Tomer W almost 6 years
    I think Simon Approach is better... and BTW @guy.gc the link is dead...