How to scrape dynamic content from a website?
So how do I scrape a website which has dynamic content?
there are a few options:
- Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
- Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
- Sometimes the data is within the
<script>
tags of the html source. You could search through those and usejson.loads()
once you manipulate the text into a json format
what exactly is the difference between dynamic and static content?
Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
Refer to your first question
how would I know that data is dynamically created?
You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR
Lastly
Amazon does offer an API to access the data. Try looking into that as well
Srikant Singh
Passionate and motivated data scientist, experience in building applied Machine learning and Deep learning solutions to meet real-world problems. Expertise in modeling complex problems, discovering insights and identifying opportunities through the use of statistical, machine learning, algorithmic, mining, and visualization techniques.
Updated on June 08, 2022Comments
-
Srikant Singh almost 2 years
So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:
import scrapy from ..items import AmazonsItem class AmazonSpiderSpider(scrapy.Spider): name = 'amazon_spider' start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6'] def parse(self, response): items = AmazonsItem() products_name = response.css('.s-access-title::attr("data-attribute")').extract() for product_name in products_name: print(product_name) next_page = response.css('li.a-last a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.
- So how do I scrape a website which has dynamic content?
- what exactly is the difference between dynamic and static content?
- How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
- how would I know that data is dynamically created?