htmlagilitypack and dynamic content issue

20,517

Solution 1

I just spent hours trying to get HtmlAgilityPack to render some ajax dynamic content from a webpage and I was going from one useless post to another until I found this one.

The answer is hidden in a comment under the initial post and I thought I should straighten it out.

This is the method that I used initially and didn't work:

private void LoadTraditionalWay(String url)
{
    WebRequest myWebRequest = WebRequest.Create(url);
    WebResponse myWebResponse = myWebRequest.GetResponse();
    Stream ReceiveStream = myWebResponse.GetResponseStream();
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
    TextReader reader = new StreamReader(ReceiveStream, encode);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(reader);
    reader.Close();
}

WebRequest will not render or execute the ajax queries that render the missing content.

This is the solution that worked:

private void LoadHtmlWithBrowser(String url)
{
    webBrowser1.ScriptErrorsSuppressed = true;
    webBrowser1.Navigate(url);

    waitTillLoad(this.webBrowser1);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; 
    StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); 
    doc.Load(sr);
}

private void waitTillLoad(WebBrowser webBrControl)
{
    WebBrowserReadyState loadStatus;
    int waittime = 100000;
    int counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
        {
            break;
        }
        counter++;
    }

    counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
        {
            break;
        }
        counter++;
    }
}

The idea is to load using the WebBrowser which is capable of rendering the ajax content and then wait till the page has fully rendered before then using the Microsoft.mshtml library to re-parse the HTML into the agility pack.

This was the only way I could get access to the dynamic data.

Hope it helps someone

Solution 2

Would Selenium do the trick. As far as I am aware it creates instances of browser engines.. sort of and should allow js to be executed and allow you to get the result of the manipulated DOM.

Share:
20,517
Chyngyz Sydykov
Author by

Chyngyz Sydykov

waiting for "Bye World"

Updated on July 09, 2022

Comments

  • Chyngyz Sydykov
    Chyngyz Sydykov almost 2 years

    I want to create a web __scraper__ application and i want to do it with webbrowser control, htmlagilitypack and xpath.

    right now i managed to create xpath generator(I used webbrowser for this purpose), which works fine, but sometimes I cannot grab dynamically (via javascript or ajax) generated content. Also I found out that when webbrowser control(actually IE browser) generates some extra tags like "tbody", while again htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);` doesn't see it.

    another note. I found out that following code actually grabs current webpage source, but I couldn't supply with it the htmlagilitypack `(mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;`

    Can you please help me with it?

  • Daniel
    Daniel almost 10 years
    Good work, Nick! Thanks for posting your solution -- it was very useful for me! What a chore! I'll add that MSHTML is named "Microsoft HTML object library" when adding the reference.
  • Phill Healey
    Phill Healey almost 10 years
    Is the document for passing to HTMLAgilityPAck now in 'sr' and this just needs manipulating?
  • Lee Englestone
    Lee Englestone over 8 years
    I tried this myself last night with Selenium (albeit with a wait) and it allowed the javascript on the page to update the DOM and I could access the changes to the DOM via code.
  • meds
    meds over 6 years
    what time is webBrowser1?
  • Korli
    Korli over 6 years
    Just for the reference, if you're running not in WinForms (or any STA) context, you will have to start the WebBrowser in STA container. Something like this: var t = new Thread(MyThreadStartMethod); t.SetApartmentState(ApartmentState.STA); t.Start();
  • Khan Engineer
    Khan Engineer almost 6 years
    I am having the same problem I want to get the content of table which is dynamically loaded with JS the div which is created by JS its id is packageTabContainer but I get null, I have tried the solution but didn't get the content here is the link I am need to extract. ikea.com/qa/en/catalog/products/60368726
  • benJima
    benJima almost 5 years
    I would like to add that WebBrowser control needs to be configured accordingly. By default ajax calls i.e. scripts do not work (at least on my system). Check accepted answer here if you face this problem as well.
  • Abdullah Tahan
    Abdullah Tahan over 4 years
    how we do it for api ?
  • Dan Cundy
    Dan Cundy over 2 years
    FYI, Selenuim is not compatible with Azure functions due to GDI+ restrictions once in the cloud.