scrape websites with infinite scrolling

29,366

Solution 1

You can use selenium to scrap the infinite scrolling website like twitter or facebook.

Step 1 : Install Selenium using pip

pip install selenium 

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

Step 3 : Print the data if required.

Solution 2

Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.

But if you must scrape...

Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.

For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.

Solution 3

Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.

Share:
29,366
add-semi-colons
Author by

add-semi-colons

Find missing Semicolons;

Updated on September 18, 2020

Comments

  • add-semi-colons
    add-semi-colons almost 4 years

    I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.

    • Gareth Latty
      Gareth Latty almost 12 years
      Your best bet here is to use those site's APIs - that will be, by far, the easiest and most reliable option. Other than that, you are going to have to process JavaScript, which basically means running a full browser - such libraries do exist.
    • add-semi-colons
      add-semi-colons almost 12 years
      sounds like a simulation, can you point towards some examples of handling jscript
  • tatsuhirosatou
    tatsuhirosatou over 11 years
    Could you please give a concrete example of how to do it?
  • kindall
    kindall over 11 years
    I've added some more detail, hope it helps!
  • Satarupa Guha
    Satarupa Guha almost 9 years
    Does the loop mean we scroll down 100 times? where 100 is a randomly chosen number. Is that right?
  • Pawan Kumar
    Pawan Kumar over 8 years
    @satarupa You are right the loop is to scroll down 100 times