How to scrape dynamic content from a website?

10,050

So how do I scrape a website which has dynamic content?

there are a few options:

  1. Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
  2. Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
  3. Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format

what exactly is the difference between dynamic and static content?

Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site

How do I extract other information like price and image from the website? and how to get particular classes for example like a price?

Refer to your first question

how would I know that data is dynamically created?

You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR

Lastly

Amazon does offer an API to access the data. Try looking into that as well

Share:
10,050
Srikant Singh
Author by

Srikant Singh

Passionate and motivated data scientist, experience in building applied Machine learning and Deep learning solutions to meet real-world problems. Expertise in modeling complex problems, discovering insights and identifying opportunities through the use of statistical, machine learning, algorithmic, mining, and visualization techniques.

Updated on June 08, 2022

Comments

  • Srikant Singh
    Srikant Singh almost 2 years

    So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:

    import scrapy
    from ..items import AmazonsItem
    
    class AmazonSpiderSpider(scrapy.Spider):
        name = 'amazon_spider'
        start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']
    
        def parse(self, response):
            items =  AmazonsItem()
            products_name = response.css('.s-access-title::attr("data-attribute")').extract()
            for product_name in products_name:
                print(product_name)
            next_page = response.css('li.a-last a::attr(href)').get()
                if next_page is not None:
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse)
    

    Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.

    1. So how do I scrape a website which has dynamic content?
    2. what exactly is the difference between dynamic and static content?
    3. How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
    4. how would I know that data is dynamically created?