Scrapy retry or redirect middleware

python redirect python-2.7 scrapy

10,419

Solution 1

I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_meta_refresh
from scrapy import log

class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        url = response.url
        if response.status in [301, 307]:
            log.msg("trying to redirect us: %s" %url, level=log.INFO)
            reason = 'redirect %d' %response.status
            return self._retry(request, reason, spider) or response
        interval, redirect_url = get_meta_refresh(response)
        # handle meta redirect
        if redirect_url:
            log.msg("trying to redirect us: %s" %url, level=log.INFO)
            reason = 'meta'
            return self._retry(request, reason, spider) or response
        hxs = HtmlXPathSelector(response)
        # test for captcha page
        captcha = hxs.select(".//input[contains(@id, 'captchacharacters')]").extract()
        if captcha:
            log.msg("captcha page %s" %url, level=log.INFO)
            reason = 'capcha'
            return self._retry(request, reason, spider) or response
        return response

In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:

DOWNLOADER_MIDDLEWARES = {
                         'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
                          'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
                          'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
}

Solution 2

You can handle 302 responses by adding handle_httpstatus_list = [302] at the beginning of your spider like so:

class MySpider(CrawlSpider):
    handle_httpstatus_list = [302]

    def parse(self, response):
        if response.status == 302:
            # Store response.url somewhere and go back to it later

10,419

Author by

Xodarap777

Updated on June 13, 2022

Comments

Xodarap777 almost 2 years

While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this:

DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm)

To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm again (maybe a few more times), I should get it. How do I set the number of tries for that?

If it matters, I'm only allowing the crawler to use a very specific starting url and then only follow "next page" links, so it should go in order through a small number of pages - hence why I need it to either retry, e.g., page 34, or come back to it later. Scrapy documentation says it should retry 20 times by default, but I don't see it retrying at all. Also if it helps: All redirects go to the same page (a "go away" page, the foo.com above) - is there a way to tell Scrapy that that particular page "doesn't count" and if it's getting redirected there, to keep retrying? I saw something in the downloader middleware referring to particular http codes in a list - can I add 302 to the "always keep trying this" list somehow?
Mariano Ruiz over 5 years

Just wanted to add that for a more wide solution (not necessarily applicable to this question) redirects also use the HTTP status code 307 (pay attention in the logs), so in that case you can replace handle_httpstatus_list = [302] with handle_httpstatus_list = [302, 307] and if response.status == 302: with if response.status in self.handle_httpstatus_list: