How to scroll down with Phantomjs to load dynamic content

35,273

Solution 1

Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript works asynchronously so you have to use setInterval or setTimeout (see) to achieve this.

page.open('http://example.com/?q=houston', function () {

  // Check for the bottom div and scroll down from time to time
  window.setInterval(function() {
      // Check if there is a div with class=".has-more-items" 
      // (not sure if there's a better way of doing this)
      var count = page.content.match(/class=".has-more-items"/g);

      if(count === null) { // Didn't find
        page.evaluate(function() {
          // Scroll to the bottom of page
          window.document.body.scrollTop = document.body.scrollHeight;
        });
      }
      else { // Found
        // Do what you want
        ...
        phantom.exit();
      }
  }, 500); // Number of milliseconds to wait between scrolls

});

Solution 2

I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.

It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.

function ScrollForAjax () {

    scrollintervals = 50;
    scrollmaxtime = 1000;

    if(typeof(scrolltime)=="undefined"){
        scrolltime = 0;
    }

    scrolldocheight1 = $(iframeselector).contents().find("body").height();

    $("body").scrollTop(scrolldocheight1);
    setTimeout(function(){

        scrolldocheight2 = $("body").height();

        if(scrolltime===scrollmaxtime || scrolltime>scrollmaxtime){
            scrolltime = 0;
            $("body").scrollTop(0);
            ScrapeCurrentPage(iframeselector);
        }

        else if(scrolldocheight2>scrolldocheight1){
            scrolltime = 0;
            ScrollForAjax (iframeselector);
        }

        else if(scrolldocheight1>=scrolldocheight2){
            ScrollForAjax (iframeselector);
        }

    },scrollintervals);

    scrolltime += scrollintervals;
}

scrollmaxtime is a timeout variable. Hope this is useful to someone :)

Solution 3

The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use window (but I may be wrong on that), which makes me doubt that window works.

The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within casper.evaluate function).

$(document).scrollTop($(document).height());

What did work for me in CasperJS was:

casper.scrollToBottom();
casper.wait(1000, function waitCb() {
  casper.capture("loadedContent.png");
});

Which, also worked when moving casper.capture into Casper's then function.

However, the above solution won't work on some sites like Twitter; jQuery seems to break the casper.scrollToBottom() function, and I had to remove the clientScripts reference to jQuery when working within Twitter.

var casper = require('casper').create({
    clientScripts: [
       // 'jquery.js'
    ]
});

Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS scrollToBottom(). Not sure why some sites work and others don't.

Solution 4

The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.

page.open(pageUrl).then(function (status) {
    var count = 0;
    // Scrolls to the bottom of page
    function scroll2btm() {
        if (count < 500) {
            page.evaluate(function(limit) {
                window.scrollTo(0, document.body.scrollHeight || document.documentElement.scrollHeight);
                return document.getElementsByClassName('pinWrapper').length; // use desired contents (eg. pin) selector for count presence number
            }).then(function(c) {
                count = c;
                console.log(count); // print no of content found to check
            });
            setTimeout(scroll2btm,3000);
        } else {
            // required number of item found
        }
    }
    scroll2btm();
});
Share:
35,273

Related videos on Youtube

Puneet Saini
Author by

Puneet Saini

Updated on October 10, 2020

Comments

  • Puneet Saini
    Puneet Saini over 3 years

    I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-

    • Setting viewportSize to a large height right after var page = require('webpage').create();

    page.viewportSize = { width: 1600, height: 10000, };

    • Using page.scrollPosition = { top: 10000, left: 0 } inside page.open but have no effect like-
    page.open('http://example.com/?q=houston', function(status) {
       if (status == "success") {
          page.scrollPosition = { top: 10000, left: 0 };  
       }
    });
    
    • Also tried putting it inside page.evaluate function but that gives

    Reference error: Can't find variable page

    • Tried using jQuery and JS code inside page.evaluate and page.open but to no avail-

    $("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('check for execution'); });

    as it is and also inside document.ready. Similarly for JS code-

    window.scrollBy(0,10000)
    

    as it is and also inside window.onload

    I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.

    Update

    I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0

    var hitRockBottom = false; while (!hitRockBottom) {
        // Scroll the page (not sure if this is the best way to do so...)
        page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
    
        // Check if we've hit the bottom
        hitRockBottom = page.evaluate(function() {
            return document.querySelector(".has-more-items") === null;
        }); }
    

    Where .has-more-items is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.

    However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 }; with codes from below as well (one at a time)

    window.document.body.scrollTop = '1000';
    location.href = ".has-more-items";
    page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
    document.location.href=".has-more-items";
    

    But nothing seems to work.

    • f.cipriani
      f.cipriani almost 11 years
      Can you provide an example url?
    • Puneet Saini
      Puneet Saini almost 11 years
      @f.cipriani My url isn't public (behind login). However Twitter stream provides the very same scenario. Say for example this account twitter.com/GSASTeaching. The bottom of tweets stream shows a loading image inside some element. I need to scroll to that element while it is available. When all content loads that element is not avaialbe both in my case and twitter stream case. I have edited my question to add more things I have tried.
    • Jimit Patel
      Jimit Patel over 6 years
      What if that class is still available? I am working where the class products-bottom products-bottom--small hide is used and it still remains there once everything is loaded. And by checking name of other classes and all seems like it has been build using reactJS
  • Shakil
    Shakil almost 11 years
    It worked like a charm...thanks was stuck for several days... window.document.body.scrollTop = document.body.scrollHeight;
  • user373480
    user373480 over 10 years
    W etried your solution. But seems it doesn't work. Am I missing anything?
  • Artjom B.
    Artjom B. almost 10 years
    It might be helpful to use window.scrollTo(0, Math.max(Math.max(document.body.scrollHeight,document.docume‌​ntElement.scrollHeig‌​ht),Math.max(documen‌​t.body.offsetHeight,‌​document.documentEle‌​ment.offsetHeight),M‌​ath.max(document.bod‌​y.clientHeight, document.documentElement.clientHeight))); because it is what casperjs does internally.
  • GeH
    GeH over 8 years
    @ArtjomB. window.scrollTo works sometimes, but not always ! It depends certainly of the library included by the webpage. I just find a case were I need to use window.document.body.scrollTop instead of window.scrollTo, to trigger the correct process.
  • Valentin V
    Valentin V over 8 years
    @ArtjomB.your snippet contains weird characters, that look like english, but are not :) Specifically, last Math.max is broken
  • Artjom B.
    Artjom B. about 8 years
    This looks like a Slimer.js script and not a PhantomJS script.
  • PGT
    PGT about 7 years
    Not sure if this is still working, I'm on [email protected] and the scrolling does not work. I have 2 second intervals, so should be enough time to load, but it doesn't seem like it's doing anything (I also did some page renders). Looks like the javascript isn't running? Weird thing is that I set the viewport with page.viewportSize = {width: 1400, height: 1200};, but page render PNG is 1400 x 2324. Anyone have any idea?
  • Christian Butzke
    Christian Butzke over 6 years
    just in case anybody wants to use @Artjom.B.'s snippet: window.scrollTo( 0, Math.max( Math.max(document.body.scrollHeight, document.documentElement.scrollHeight), Math.max(document.body.offsetHeight, document.documentElement.offsetHeight), Math.max(document.body.clientHeight, document.documentElement.clientHeight) ) );
  • Luis A. Florit
    Luis A. Florit over 3 years
    This works, but the scrolling is taking longer and longer as the page scrolls down, to the point of almost stalling.