Python's requests triggers Cloudflare's security while urllib does not

python python-3.x web-scraping python-requests

11,674

Solution 1

After some debugging, and thanks to the answers of @TuanGeek, we've found out the issue with the requests library seems to come from a DNS issue on requests' part when dealing with cloudflare, a simple fix to this issue is connecting directly to the host IP as such:

import requests
from collections import OrderedDict
from requests import Session
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]

s = Session()
headers = OrderedDict({
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)

Now, this fix didn't work when working with the httplib HTTPX, However I've found where the issue stems from.

The issue comes from the h11 library (used by HTTPX to handle HTTP/1.1 requests), while urllib would automatically fix the letter case of headers, h11 took a different approach by lowercasing every header. While in theory this shouldn't cause any issues, as servers should handle headers in a case-insensitive manner (and in a lot of cases they do), the reality is that HTTP is Hard™️ and services such as Cloudflare don't respect RFC2616 and requires headers to be properly capitalized.

Discussions about capitalization have been going for a while over at h11:

https://github.com/python-hyper/h11/issues/31

And have "recently" started to pop up over on HTTPX's repo as well:

https://github.com/encode/httpx/issues/538

https://github.com/encode/httpx/issues/728

Now the unsatisfactory answer to the issue between Cloudflare and HTTPX is that until something is done over on h11's side (or until Cloudflare miraculously starts respecting RFC2616), not much can be changed to how HTTPX and Cloudflare handle header capitalization.

Either use a different HTTPLIB such as aiohttp or requests-futures, try forking and patching the header capitalization with h11 yourself, or wait and hope for the issue to be dealt with properly by the h11 team.

Solution 2

This really piqued my interests. The requests solution that I was able to get working.

Solution

Finally narrow down the problem. When you use requests it uses urllib3 connection pool. There seems to be some inconsistency between a regular urllib3 connection and a connection pool. A working solution:

import requests
from collections import OrderedDict
from requests import Session
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]

s = Session()
headers = OrderedDict({
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)

Technical Background

So I ran both method through Burp Suite to compare the requests. Below are the raw dumps of the requests

using requests

GET /guest/accountlogin HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Connection: close
Host: grimaldis.myguestaccount.com
Accept-Language: en-GB,en;q=0.5
Upgrade-Insecure-Requests: 1
dnt: 1

using urllib

GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1

The difference is the ordering of the headers. The difference in the dnt capitalization is not actually the problem.

So I was able to make a successful request with the following raw request:

GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0

So the Host header has be sent above User-Agent. So if you want to continue to to use requests. Consider using a OrderedDict to ensure the ordering of the headers.

Solution 3

I encountered the same issue when scraping one ecommerce website (guess dot com). Changing headers order didn't fix it for me. My conclusions: apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping.

The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja

Read more on how CloudFlare analyzes TLS: https://blog.cloudflare.com/monsters-in-the-middleboxes/

11,674

Author by

Tom

Updated on June 03, 2022

Comments

Tom almost 2 years

I'm working on an automated web scraper for a Restaurant website, but I'm having an issue. The said website uses Cloudflare's anti-bot security, which I would like to bypass, not the Under-Attack-Mode but a captcha test that only triggers when it detects a non-American IP or a bot. I'm trying to bypass it as Cloudflare's security doesn't trigger when I clear cookies, disable javascript or when I use an American proxy.

Knowing this, I tried using python's requests library as such:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)

But this ends up triggering Cloudflare, no matter the proxy I use.

HOWEVER when using urllib.request with the same headers as such:

import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))

When run with the same American IP, this time it does not trigger Cloudflare's security, even though it uses the same headers and IP used with the requests library.

So I'm trying to figure out what exactly is triggering Cloudflare in the requests library that isn't in the urllib library.

While the typical answer would be "Just use urllib then", I'd like to figure out what exactly is different with requests, and how I could fix it, first off to understand how requests works and Cloudflare detects bots, but also so that I may apply any fix I can find to other httplibs (notably asynchronous ones)

EDIT N°2: Progress so far:

Thanks to @TuanGeek we can now bypass the Cloudflare block using requests as long as we connect directly to the host IP rather than the domain name (for some reason, the DNS redirection with requests triggers Cloudflare, but urllib doesn't):

import requests
from collections import OrderedDict
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
s = requests.Session()
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", verify=False).text

To note: trying to access via HTTP (rather than HTTPS with the verify variable set to False) will trigger Cloudflare's block

Now this is great, but unfortunately, my final goal of making this work asynchronously with the httplib HTTPX still isn't met, as using the following code, the Cloudflare block is still triggered even though we're connecting directly through the Host IP, with proper headers, and with verifying set to False:

import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
    async with httpx.AsyncClient(headers=headers, verify=False) as s:
        r = await s.get(f'https://{address}/guest/accountlogin')
        print(r.text)
async def run_task():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(asks_worker)
trio.run(run_task)

EDIT N°1: For additional details, here's the raw HTTP request from urllib and requests

REQUESTS:

send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR

URLLIB:

send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js  https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR

Tom almost 4 years

But so how would you go about to fixing this? Because even with the capitalized Dnt and re-organized headers, requests still triggers cloudflare's antibot. What's more is that with a bit of testing, I was able to find that urllib is still able to bypass cloudlfare's detection with just two headers: headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0', 'Accept-Encoding': 'gzip, deflate, br'}. So I'm still guessing there's ought to be something else going on. EDIT: Actually it only requires the User-Agent header to still bypass it
TuanGeek almost 4 years

The ordering of the headers matter. I've added the exact solution using requests.
Tom almost 4 years

Your answer still does not work though. While the request goes through, try checking the status code of your solution code, or run the html response through an HTML viewer. The requests goes through, but it still returns a 403 Forbidden response with the captcha challenge.
TuanGeek almost 4 years

Wow that is weird. I ran the code yesterday and it worked. Back to the drawing bord! I wonder if running the request through Burp Suite is affecting it.
TuanGeek almost 4 years

Yea. Just doubled checked. When I the code through Burp Suite it works. But if I run it without Burp Suite it fails. Which is weird because Burp Suite should not be modifying the request at all.
TuanGeek almost 4 years

Okay. Updated the solution. Cloudflare seems to be causing issues for requests DNS queries. I will have to dig into why requests is failing with DNS queries. But the work around is using socket to grab the IP address and using that address in the request.
Tom almost 4 years

You're right, it does work when using the direct IP. Unfortunately it doesn't seem to work on other httplibs (such as HTTPX, that raise a ConnectionError when trying to connect through IP), though at least we have a clue to work on.
SilverlightFox over 2 years

I tried (what I thought was) this and got some weird problem where sometimes the headers were out of order. My problem was that I was passing headers to requests.get rather than using Session(). You must use Session()!
winwin almost 2 years

The capitalization trick worked. I laughed hard at it, but all that was required is 'User-Agent' instead of 'user-agent'.