How to bypass Incapsula with Python

15,825

Solution 1

So first of all there is no fool proof solutions to such problems. I as a actual user end-up having to solve captcha while answering on StackOverflow. Which means a bot will definitely get captchas.

Now there are few rules which I try and follow to decrease the chances of an captcha

  • Never ever use shared proxies for such projects. Using TOR is a big NO
  • Use Chrome + Selenium + Proxy
  • Use Chrome with existing profile. I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. You don't know how the evaluation of a user/bot difference may happen. So you want to look more like a real user
  • Never scrape at fast rates, use as many delays as possible and as random delays as possible
  • Always use a visible browser and keep monitoring the captcha, on captcha appearance manually solve the captcha or use a DeathByCaptcha or similar service. Try not to abort captcha pages as it may increase your bot probability check to a higher grade

This is a cat and mouse game, where you don't know what the other party has as a defense. So you try to play nice and easy

Solution 2

This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front.

First, maybe good to check what you will be fighting against, WAF & Bot Mitigation.

Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin

Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked.

There are multiple approaches like using different IP per requests: Make requests using Python over Tor, change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found.

Share:
15,825

Related videos on Youtube

parik
Author by

parik

Updated on October 23, 2022

Comments

  • parik
    parik over 1 year

    I use Scrapy and I try to scrape this site that uses Incapsula

    <meta name="robots" content="noindex,nofollow">
    <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
    </script>
    

    I had already asked a Question about this issue 2 years ago, but this method (Incapsula-Cracker) does not work anymore.

    I tried to understand How Incapsula works and I tried this for bypass it

    def start_requests(self):
        yield Request('https://courses-en-ligne.carrefour.fr',  cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
    def init_shop(self,response) :
        result_content      = response.body
        RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
        RE_INCAPSULA        = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
        INCAPSULA_URL       = 'https://courses-en-ligne.carrefour.fr/%s'
        encoded_func        = RE_ENCODED_FUNCTION.search(result_content).group(1)
        decoded_func        = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
        incapsula_params    = RE_INCAPSULA.search(decoded_func).group(1)
        incap_url           = INCAPSULA_URL % incapsula_params
        yield Request(incap_url)
    def parse(self):
        print response.body 
    

    But i'm redirected to RE-Captcha Page

    <html style="height:100%">
    <head>
    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
    <meta name="format-detection" content="telephone=no">
    <meta name="viewport" content="initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    </head>
    <body style="margin:0px;height:100%">
    <iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
    </iframe>
    </body>
    </html>
    
    • Blender
      Blender about 6 years
      Why don't you just pay for a captcha solving service?