Can I get the original page source (vs current DOM) with phantomjs/casperjs?

13,069

Solution 1

Hum, did you try using some events? For example :

casper.on('load.started', function(resource) {
    casper.echo(casper.getPageContent());
});

I think it won't work, try it anyway.

The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

There is still the phantomJS option, you use it : in casper :

casper.pageSettings.javascriptEnabled = false;

The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

Otherwise you have to block the wanted ressource/script which modify the DOM using events.

Or you could use the resource.received event to scrape the data wanted before the specific resources modifing DOM appear.

In fact I don't think it's possible because if you create a step which get back some data from page just before specific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

Don't know how to do it though, but these events could help you :

casper.on('resource.requested', function(request) {
    console.log(" request " + request.url);
});

casper.on('resource.received', function(resource) {
    console.log(resource.url);
});

casper.on('resource.error',function (request) {
    this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
});

See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms") !

onResourceRequested

Here a similar question : Injecting script before other

Solution 2

As Fanch pointed out, it seems it's not possible to do this. If you are able to do two requests, then this gets easy. Simply do one request with JavaScript enabled and one without, so you can scrape the page source and compare it.

casper
    .then(function(){
        this.options.pageSettings.javascriptEnabled = false;
    })
    .thenOpen(url, function(){
        this.echo("before JavaScript");
        this.echo(this.getHTML());
    })
    .then(function(){
        this.options.pageSettings.javascriptEnabled = true;
    })
    .thenOpen(url, function(){
        this.echo("before JavaScript");
        this.echo(this.getHTML());
    });

You can change the order according to your needs. If you're already on a page that you want to have the original markup of, then you can use casper.getCurrentUrl() to get the current URL:

casper
    .then(function(){
        // submit or whatever
    })
    .thenOpen(url, function(){
        this.echo("after JavaScript");
        this.echo(this.getHTML());
        this.options.pageSettings.javascriptEnabled = false;

        this.thenOpen(this.getCurrentUrl(), function(){
            this.echo("before JavaScript");
            this.echo(this.getHTML());
        })
    });
Share:
13,069
supercoco
Author by

supercoco

Connecting People, Integrating Technology

Updated on June 24, 2022

Comments

  • supercoco
    supercoco almost 2 years

    I am trying to get the original source for a particular web page.

    The page executes some scripts that modify the DOM as soon as it loads. I would like to get the source before any script or user changes any object in the document.

    With Chrome or Firefox (and probably most browsers) I can either look at the DOM (debug utility F12) or look at the original source (right-click, view source). The latter is what I want to accomplish.

    Is it possible to do this with phantomjs/casperjs?

    Before getting to the page I have to log in. This is working fine with casperjs. If I browse to the page and render the results I know I am on the right page.

    casper.thenOpen('http://'+customUrl, function(response) {
        this.page.render('example.png'); // *** Renders correct page (current DOM) ***
        console.log(this.page.content); // *** Gets current DOM ***
        casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
        console.log(this.getHTML()); // *** Gets current DOM ***
        this.debugPage(); // *** Gets current DOM ***
        utils.dump(response); // *** No BODY ***
        casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
    });
    

    I've tried this.download(url, 'a.html') but it doesn't seem to share the same context since it returns HTML as if I was not logged in, even if I run with cookies casperjs test.casper.js --cookies-file=cookies.txt.

    I believe I should keep analyzing this option.


    I have also tried casper.open('view-source:url') instead of casper.open('http://url') but it seems it doesn't recognize the url since I just get a blank page.

    I have looked at the raw HTTP Response I get from the server with a utility I have and the body of this message (which is HTML) is what I need but when the page loads in the browser the DOM has already been modified.

    I tried:

    casper.thenOpen('http://'+url, function(response) {
        ...
    }
    

    But the response object only contains the headers and some other information but not the body.


    I also tried with the event onResourceRequested.

    The idea is to abort the download of any resource needed by a specific web page (the referer).

    onResourceRequested: function(casperObj, requestData, networkRequest) {
    for (var i=0; i < requestData.headers.length; i++) {
        var obj = requestData.headers[i];
        if (obj.name === "Referer" && obj.value === 'http://'+customUrl) {
            networkRequest.abort();
            break;
        }
    }
    

    Unfortunately the script that modifies the DOM initially seems to be inline the main HTML page (or this code is not doing what I would like it to do).


    ¿Any ideas?

    Here is the full code:

    phantom.casperTest = true;
    phantom.cookiesEnabled = true;
    
    var utils = require('utils');
    var casper = require('casper').create({
        clientScripts:  [],
        pageSettings: {
            loadImages:  false,
            loadPlugins: false,
            javascriptEnabled: true,
            webSecurityEnabled: false
        },
        logLevel: "error",
        verbose: true
    });
    
    casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
    
    casper.start('http://www.xxxxxxx.xxx/login');
    
    casper.waitForSelector('input#login',
        function() {
            this.evaluate(function(customLogin, customPassword) {
                document.getElementById("login").value = customLogin;
                document.getElementById("password").value = customPassword;
                document.getElementById("button").click();
            }, {
                "customLogin": customLogin,
                "customPassword": customPassword
            });
        },
        function() {
            console.log('Can't login.');
        },
        15000
    );
    
    casper.waitForSelector('div#home',
        function() {
            console.log('Login successfull.');
        },
        function() {
            console.log('Login failed.');
        },
        15000
    );
    
    casper.thenOpen('http://'+customUrl, function(response) {
        this.page.render('example.png'); // *** Renders correct page (current DOM) ***
        console.log(this.page.content); // *** Gets current DOM ***
        casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
        console.log(this.getHTML()); // *** Gets current DOM ***
        this.debugPage(); // *** Gets current DOM ***
        utils.dump(response); // *** No BODY ***
        casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
    });
    
  • Artjom B.
    Artjom B. almost 10 years
    OP wants completely unchanged source, but debugPage prints the current page. This is not answer.
  • the binary
    the binary almost 10 years
    Updated my answer to use #debugHTML() instead of #debugPage()
  • supercoco
    supercoco almost 10 years
    That doesn't work either. It does not return the original HTML.
  • Fanch
    Fanch almost 10 years
    And instead of aborting, did you try to get back the HTML on an adequate resource received? with var fs = require('fs'); fs.write("results.html", casper.getPageContent(), 'w');
  • Gogowitsch
    Gogowitsch almost 5 years
    @thebinary: please remove this non-answer.