Is it possible to scrape a React website (Instagram) with Cheerio?

19,713

In the general case -- if the website is SEO friendly, you can do it by spoofing the user agent string of a web crawler. This returns a rendered DOM that can be parsed by Cheerio.

In the specific case -- Instagram returns a rendered DOM on its mobile web sites. Spoof the user agent string of a mobile phone and you can parse the data that is returned.

      var options = {
        url: user.instagram_url,
        headers: {
          'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'
        }
      };

      request(options, function(error, response, html) {
        if (!error) {

          console.log('Scraper running on Instagram user page.');

          // Use Cheerio to load the page.
          var $ = cheerio.load(html);

          // Code to parse the DOM here

        }
      }
Share:
19,713

Related videos on Youtube

Kyle Chadha
Author by

Kyle Chadha

Management consultant turned software engineer. Go, Node.js, GCP, Kubernetes, Big Data (Scala, Apache Beam, Dataflow)

Updated on September 15, 2022

Comments

  • Kyle Chadha
    Kyle Chadha about 1 year

    I'm trying to scrape Instagram (built with React) with Node.js / Cheerio. Debugging the document shows an object returned, but it doesn't look like the typical response.

    I'm guessing this has to do with React. Is there a way to get around this, and pull the rendered DOM to parse with Cheerio? Or am I missing something entirely?

    • Kyle Chadha
      Kyle Chadha over 8 years
      This is a conceptual question with a binary answer -- thanks for being unhelpful.
    • Kyle Chadha
      Kyle Chadha over 8 years
      Fair enough. I've posted the answer below. The code is what is below, minus the User-Agent. Unfortunately no jsFiddle since this is server side code, and no error message since there was a response returned, just not one that was parseable by Cheerio (React creates a virtual DOM).
  • xmojmr
    xmojmr over 8 years
    Can you explain "virtual DOM rendered on a mobile web site not parseable by Cheerio"? Some "see also" hyperlink or some html snippet sample returned from the unspoofed query? Something so that someone else can comprehend what kind of problem you've found and solved? I know what's instagram, node.js, cheerio, html, css, javascript, document object model, search engine optimization and other stuff, but still I find it hard to image what do you see when looking at your computer screen...
  • huzefa biyawarwala
    huzefa biyawarwala almost 8 years
    @Kyle : I am not able to find a mobile website that can be opened on my desktop for Instagram . please give a link if you have . Thank you .
  • Kyle Chadha
    Kyle Chadha almost 8 years
    You have to change your user agent string. You can do so using Chrome browser emulation or in the Cheerio options as I have done above.
  • Tim Malone
    Tim Malone almost 7 years
    @KyleChadha Thanks for posting this. Did you ever manage to take this concept further for cases when the site returns the same React string whether or not you've used a search engine/mobile UA?
  • Tim Malone
    Tim Malone almost 7 years
    @KyleChadha Actually, just found this: stackoverflow.com/questions/29972996/how-to-parse-dom-react
  • Kyle Chadha
    Kyle Chadha almost 7 years
    @TimMalone Hi Tim, didn't have to in my case. This is about 2 years old though, so things may have changed... here's my code in the off chance it's helpful: github.com/kylechadha/lookbook-scraper/blob/master/app/servi‌​ces/…