Running Scripts in HtmlAgilityPack

25,200

Solution 1

You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.

Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.

Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.

Solution 2

You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

Share:
25,200
Aabela
Author by

Aabela

Updated on January 27, 2020

Comments

  • Aabela
    Aabela about 4 years

    I'm trying to scrape a particular webpage which works as follows.

    First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.

    If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.

    Is there a way to force it to run a script, so I can get the data?

  • Aabela
    Aabela almost 12 years
    The GDI+Handle/Memory leak caused by the WebBrowser control is what is driving me to seek alternatives. I'm just sorry that there isn't a proper solution to this problem.
  • Jamie Treworgy
    Jamie Treworgy almost 12 years
    Bummer. Yeah this is one of those places that isn't quite there yet, at least if you keep it all within .NET. If you can live with a hybrid app, there are definitely ways to do this, but it will be more complicated. I keep hoping someone will do the work to create at real unified headless browser entirely in .NET. But it's definitely no small task. Like I said a lot of the pieces are there but someone needs to put them together.