Is it possible to scrape a React website (Instagram) with Cheerio?
In the general case -- if the website is SEO friendly, you can do it by spoofing the user agent string of a web crawler. This returns a rendered DOM that can be parsed by Cheerio.
In the specific case -- Instagram returns a rendered DOM on its mobile web sites. Spoof the user agent string of a mobile phone and you can parse the data that is returned.
var options = {
url: user.instagram_url,
headers: {
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'
}
};
request(options, function(error, response, html) {
if (!error) {
console.log('Scraper running on Instagram user page.');
// Use Cheerio to load the page.
var $ = cheerio.load(html);
// Code to parse the DOM here
}
}
Related videos on Youtube
Kyle Chadha
Management consultant turned software engineer. Go, Node.js, GCP, Kubernetes, Big Data (Scala, Apache Beam, Dataflow)
Updated on September 15, 2022Comments
-
Kyle Chadha about 1 year
I'm trying to scrape Instagram (built with React) with Node.js / Cheerio. Debugging the document shows an object returned, but it doesn't look like the typical response.
I'm guessing this has to do with React. Is there a way to get around this, and pull the rendered DOM to parse with Cheerio? Or am I missing something entirely?
-
Kyle Chadha over 8 yearsThis is a conceptual question with a binary answer -- thanks for being unhelpful.
-
Kyle Chadha over 8 yearsFair enough. I've posted the answer below. The code is what is below, minus the User-Agent. Unfortunately no jsFiddle since this is server side code, and no error message since there was a response returned, just not one that was parseable by Cheerio (React creates a virtual DOM).
-
-
xmojmr over 8 yearsCan you explain "virtual DOM rendered on a mobile web site not parseable by Cheerio"? Some "see also" hyperlink or some
html
snippet sample returned from the unspoofed query? Something so that someone else can comprehend what kind of problem you've found and solved? I know what's instagram, node.js, cheerio, html, css, javascript, document object model, search engine optimization and other stuff, but still I find it hard to image what do you see when looking at your computer screen... -
huzefa biyawarwala almost 8 years@Kyle : I am not able to find a mobile website that can be opened on my desktop for Instagram . please give a link if you have . Thank you .
-
Kyle Chadha almost 8 yearsYou have to change your user agent string. You can do so using Chrome browser emulation or in the Cheerio options as I have done above.
-
Tim Malone almost 7 years@KyleChadha Thanks for posting this. Did you ever manage to take this concept further for cases when the site returns the same React string whether or not you've used a search engine/mobile UA?
-
Tim Malone almost 7 years@KyleChadha Actually, just found this: stackoverflow.com/questions/29972996/how-to-parse-dom-react
-
Kyle Chadha almost 7 years@TimMalone Hi Tim, didn't have to in my case. This is about 2 years old though, so things may have changed... here's my code in the off chance it's helpful: github.com/kylechadha/lookbook-scraper/blob/master/app/services/…