selenium with scrapy for dynamic page

103,714

Solution 1

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

Here are some examples of "selenium spiders":


There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

Solution 2

If (url doesn't change between the two pages) then you should add dont_filter=True with your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.

If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser

But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.

Here is an example :

class ScrollScraper(Spider):
    name = "scrollingscraper"

    quote_url = "http://quotes.toscrape.com/api/quotes?page="
    start_urls = [quote_url + "1"]

    def parse(self, response):
        quote_item = QuoteItem()
        print response.body
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            quote_item["author"] = item.get('author', {}).get('name')
            quote_item['quote'] = item.get('text')
            quote_item['tags'] = item.get('tags')
            yield quote_item

        if data['has_next']:
            next_page = data['page'] + 1
            yield Request(self.quote_url + str(next_page))

When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest() instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.

Here is another spider example form this post:

class SpiderClass(scrapy.Spider):
    # spider name and all
    name = 'ajax'
    page_incr = 1
    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if self.page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))

        # your code here

        # pagination code starts here
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr += 1
            formdata = {
                'sorter': 'recent',
                'location': 'main loop',
                'loop': 'main loop',
                'action': 'sort',
                'view': 'grid',
                'columns': '3',
                'paginated': str(self.page_incr),
                'currentquery[category_name]': 'reviews'
            }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return
Share:
103,714
Z. Lin
Author by

Z. Lin

I have used Python for lots of things, from auto-generating code, verifying chip design, to developing web/mobile app backend.

Updated on April 28, 2020

Comments

  • Z. Lin
    Z. Lin about 4 years

    I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

    • starts with a product_list page with 10 products
    • a click on "next" button loads the next 10 products (url doesn't change between the two pages)
    • i use LinkExtractor to follow each product link into the product page, and get all the information I need

    I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

    My spider is pretty standard, like the following:

    class ProductSpider(CrawlSpider):
        name = "product_spider"
        allowed_domains = ['example.com']
        start_urls = ['http://example.com/shanghai']
        rules = [
            Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
            ]
    
        def parse_product(self, response):
            self.log("parsing product %s" %response.url, level=INFO)
            hxs = HtmlXPathSelector(response)
            # actual data follows
    

    Any idea is appreciated. Thank you!

  • Z. Lin
    Z. Lin almost 11 years
    thanks for your help. Actually my biggest problem is in the part after next.click(). Every time I get a new page, but can I still use a LinkExtractor to extract all the product urls then use a callback to parse them?
  • Ethereal
    Ethereal over 10 years
    Is there a way to re-use the reponse that has already been grabbed by scrapy instead of using self.driver.get(response.url) ?
  • alecxe
    alecxe almost 10 years
    @Z.Lin is the problem you've described still present? If you've solved it, could you share the solution? Thanks. Also, consider accepting the answer if it helped.
  • alecxe
    alecxe almost 10 years
    @Ethereal I'm afraid this is the overhead you get here. Good point though.
  • KrisWebDev
    KrisWebDev over 9 years
    To install on ubuntu: sudo pip install selenium. To hide the browser window: install & example on this post.
  • Amistad
    Amistad about 9 years
    hi @alecxe..I had a very similar question at stackoverflow.com/questions/28420078/… which is something right up your alley..if you do get some time,please have a look into it
  • Halcyon Abraham Ramirez
    Halcyon Abraham Ramirez almost 9 years
    If we were to use this code. Wouldn't it be better to just use selenium all the way? I mean scrapy isn't doing anything here
  • alecxe
    alecxe almost 9 years
    @HalcyonAbrahamRamirez this is just an example with the selenium part in the scrapy spider. After selenium is done, usually the self.driver.page_source is passed in to a Selector instance for Scrapy to parse the HTML, form the item instances, pass them to pipelines etc. Or, selenium cookies can be parsed and passed to Scrapy to make additional requests. But, if you don't need the power of the scrapy framework architecture, then, sure, you can use just selenium - it is itself quite powerful in locating the elements.
  • Halcyon Abraham Ramirez
    Halcyon Abraham Ramirez almost 9 years
    @alecxe yeah while I get the concept. Im still confused at the part where you extract the page source using selenium and pass the elements you want to be scraped to scrapy. for example. There is a load more button clicking it will show more items but and you extract the xpath for those items. now how do you pass those xpaths to scrapy? because only the items shown when you first requested the page will be parsed by scrappy and not the ones after clicking the load more button with selenium
  • alecxe
    alecxe almost 9 years
    @HalcyonAbrahamRamirez got it, I would load more items until there is no more to add. Then, take the driver.page_source and pass it to the Selector()..
  • Halcyon Abraham Ramirez
    Halcyon Abraham Ramirez almost 9 years
    ok I got kinda get thank you @alecxe one last question. using that approach with selenium and loading more items and stuff. Is it possible to use the crawl spider class to extract info from the newly loaded items?
  • alecxe
    alecxe almost 9 years
    @HalcyonAbrahamRamirez this is something I would need additional info and see your code. Could you create a separate question so that more people can help you? Thanks.
  • Halcyon Abraham Ramirez
    Halcyon Abraham Ramirez almost 9 years
    I would if I could but im banned from asking questions. F*** me. anyway thanks for the time :D
  • Volatil3
    Volatil3 almost 9 years
    Doing it without scrapy with Selenium driver. The condition does not hold true once it navigates to 2nd page. How to keep it clicking?
  • Benjamin James
    Benjamin James about 8 years
    @alecxe Using your advice I get "TypeError: cannot create weak reference to 'unicode' object" .. I've never seen this before- any help?
  • alecxe
    alecxe about 8 years
    @BenjaminJames this is not related to the answer itself. Try googling or asking a separate question here on SO.
  • oldboy
    oldboy almost 6 years
    @alecxe i'm getting the following error ([4960:6000:0612/235425.186:ERROR:shader_disk_cache.cc(237)] Failed to create shader cache entry: -2). can you please tell me what's i'm doing wrong if i show you my script?!?! anybody?