How to deal with the captcha when doing Web Scraping in Puppeteer?

29,367

Solution 1

This is a reCAPTCHA (version 2, check out demos here), which is shown to you as the owner of the page does not want you to automatically crawl the page.

Your options are the following:

Option 1: Stop crawling or try to use an official API

As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.

Option 2: Automate/Outsource the captcha solving

There is an entire industry which has people (often in developing countries) filling out captchas for other people's bots. I will not link to any particular site, but you can check out the other answer from Md. Abu Taher for more information on the topic or search for captcha solver.

Option 3: Solve the captcha yourself

For this, let me explain how reCAPTCHA works and what happens when you visit a page using it.


How reCAPTCHA (v2) works

Each page has an ID, which you can check by looking at the source code, example:

<div class="g-recaptcha form-field" data-sitekey="ID_OF_THE_WEBSITE_LONG_RANDOM_STRING"></div>

When the reCAPTCHA code is loaded it will add a response textarea to the form with no value. It will look like this:

<textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="... display: none;"></textarea>

After you solved the challenge, reCAPTCHA will add a very long string to this text field (which can then later be checked by the server/reCAPTCHA service in the backend) when the form is submitted.


How to solve the captcha yourself

By copying the value of the textarea field you can transfer the "solved challenge" from one browser to another (this is also what the solving services to for you). The full process looks like this:

  1. Detect if the page uses reCAPTCHA (e.g. check for .g-recaptcha) in the "crawling" browser
  2. Open a second browser in non-headless mode with the same URL
  3. Solve the captcha yourself
  4. Read the value from: document.querySelector('#g-recaptcha-response').value
  5. Put that value into the first browser: document.querySelector('#g-recaptcha-response').value = '...'
  6. Submit the form

Further information/reading

There is not much public information from Google how exactly reCAPTCHA works as this is a cat-and-mouse game between bot creators and Google detection algorithms, but there are some resources online with more information:

  • Official docs from Google: Obviously, they just explain the basics and not how it works "in the back"
  • InsideReCaptcha: This is a project from 2014 which tries to "reverse-engineer" reCAPTCHA. Although this is quite old, there is still a lot of useful information on the page.
  • Another question on stackoverflow: This question contains some useful information about reCAPTCHA, but also many speculative (and very likely) outdated approaches on how to fool a reCAPTCHA.

Solution 2

You should use combination of following:

  • Use an API if the target website provides that. It's the most legal way.
  • Increase wait time between scraping request, do not send mass request to the server.
  • Change/rotate IP frequently.
  • Change user agent, browser viewport size and fingerprint.
  • Use third party solutions for captcha.
  • Resolve the captcha by yourself, check the answer by Thomas Dondorf. Basically you need to wait for the captcha to appear on another browser, solve it from there. Third party solutions does this for you.

Disclaimer: Do not use anti-captcha plugins/services to misuse resources. Resources are expensive.


Basically the idea is to use anti-captcha services like (2captcha) to deal with persisting recaptcha.

You can use this plugin called puppeteer-extra-plugin-recaptcha by berstend.

// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')

// add recaptcha plugin and provide it your 2captcha token
// 2captcha is the builtin solution provider but others work as well.
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
puppeteer.use(
  RecaptchaPlugin({
    provider: { id: '2captcha', token: 'XXXXXXX' },
    visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
  })
)

Afterwards you can run the browser as usual. It will pick up any captcha on the page and attempt to resolve it. You have to find the submit button which varies from site to site if it exists.

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
  const page = await browser.newPage()
  await page.goto('https://www.google.com/recaptcha/api2/demo')

  // That's it, a single line of code to solve reCAPTCHAs 🎉
  await page.solveRecaptchas()

  await Promise.all([
    page.waitForNavigation(),
    page.click(`#recaptcha-demo-submit`)
  ])
  await page.screenshot({ path: 'response.png', fullPage: true })
  await browser.close()
})

PS:

  • There are other plugins, even I made a very simple one because captcha is getting harder to solve even for a human like me. You can read the code here.
  • I am strongly not affiliated with 2Captcha or any other third party services mentioned above.
  • I had created my own solution which is similar to the other answer by Thomas Dondorf, but gave up soon since Captcha is getting more ridiculous and I do not have mental energy to resolve them.

Solution 3

Proxy servers can be used so that the destination site does not detect a load of responses from a single IP address.

(Translated into Google Translate)

Solution 4

I tried @Thomas Dondorf suggestion, but I think the problem with the steps described in "How to solve the captcha yourself" section is that the token of the CAPTCHA it's valid only one time. I'll try to explain everything in detail below.

WHAT I'M USING

I'm using as first browser (the one that will not solve the captcha) Google Chrome, and as a second browser (the one where i solve the captcha and i take the token) Firefox.

STEPS

  1. I manually solve the captcha on this site https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php
  2. I type the following code document.querySelector('#g-recaptcha-response').value in the google chrome console, but I get an error (VM22:1 Uncaught TypeError: Cannot read property 'value' of null at :1:48), so I just search the token by opening Elements in Google Chrome and searching g-recaptcha-response with CTRL+F
  3. I copy the token of the recaptcha (here is an image to show where the token is, after the text highlighted in green)here is the part of the code of the webpage that has the token
  4. I type the following code document.querySelector('#g-recaptcha-response').value = '...'in the firefox console, replacing the "..." with the recaptcha token just copied
  5. I get the following errorerror and, if you then click on the documentation linked, you'll read that the error is due to the fact that a token can be used only one time, and it has of course already been used for the CAPTCHA you just solved to obtain the token itself (so it seems that the only objective of the token it's to say that the CAPTCHA has already been solved, it seems a sort of defense measurement to prevent replay attacks, as said here in the official documentation of the recaptcha.
Share:
29,367
Admin
Author by

Admin

Updated on July 24, 2021

Comments

  • Admin
    Admin almost 3 years

    I'm using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I'm trying to scrape asks for a captcha due to the amount of visits I'm doing from my computer. The captcha form looks like this one:

    captcha

    So, I would need help about how to handle this. I have been thinking about sending the captcha form to the client-side since I use Express and EJS in order to send the values to my index website, but I don't know if Puppeteer can send something like that.

    Any ideas?

  • Admin
    Admin about 5 years
    Sadly, the target website doesn't provide an API to use. I have already searched for it... I have tested your code, but it seems that something is wrong. It doesn't solve the captcha since it appears a red border around it and it tells me to proof that I'm not a bot: i.imgur.com/jIVPvuE.png. Is it due to I have a different language than english?
  • Admin
    Admin about 5 years
    Also, what should I put in token? I have replaced those XXX for the data-sitekey value. Is that correct?
  • Md. Abu Taher
    Md. Abu Taher about 5 years
    No, you buy credits from 2captcha, use their API (I'm not affiliated with them). In case you do not want to use money, then the only other way is to solve the captcha yourself which I did not add to my answer but Thomas Dondorf added it on another answer. Someone has to solve the captcha, you or other people. :D
  • Sumeet
    Sumeet almost 5 years
    @ThomasDondorf Can you please explain to me how third party captcha solving works while they open the page containing captcha with another IP and browser? Doesn't google track the IP and browser on which the captcha is being solved? And how is it possible to use response solved using another browser, ip and location?
  • Thomas Dondorf
    Thomas Dondorf almost 5 years
    @sumeet AFAIK the captcha is not bound to the browser or IP address. If you solve the captcha, you can pass, no matter how shady your browser "fingerprint" might be.
  • Boris Verkhovskiy
    Boris Verkhovskiy over 2 years
    If the site lets you make at least one request before showing the CAPTCHA, there is also a 4th option of using a provider of proxies with "residential" IP addresses. They give you a bunch of IP addresses, you make a requests from one IP address until you get the CAPTCHA, then switch to a new IP address.