How to make Scrapy show user agent per download request in log?

python web-scraping scrapy web-crawler user-agent

19,986

Solution 1

Just FYI.

I've implemented a simple RandomUserAgentMiddleware middleware based on fake-useragent.

Thanks to fake-useragent, you don't need to configure the list of User-Agents - it picks them up based on browser usage statistics from a real-world database.

Solution 2

You can see it by using this:

def parse(self, response):
    print response.request.headers['User-Agent']

You can use scrapy-fake-useragent python library. It works perfectly and it chooses user agent based on world usage statistic. But be careful, check if it's already working perfectly using above code since you might do some mistake when applying it. Read the instruction carefully.

Solution 3

You can add logging to solution you're using:

#!/usr/bin/python
#-*-coding:utf-8-*-
import random

from scrapy import log
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

            # Add desired logging message here.
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                level=log.DEBUG
            )


    #the default user_agent_list composes chrome,IE,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    ]

Solution 4

EDIT: I came here because I was looking for the functionality to change the user agent.

Based to alecxe's RandomUserAgent, this is what I use to set the user agent only once per crawl and only from a predefined list (works for me with scrapy v. 0.24 & 0.25):

    """
    Choose a user agent from the settings but do it only once per crawl.
    """
    import random
    import scrapy

    SETTINGS = scrapy.utils.project.get_project_settings()


    class RandomUserAgentMiddleware(object):
        def __init__(self):
            super(RandomUserAgentMiddleware, self).__init__()
            self.fixedUserAgent = random.choice(SETTINGS.get('USER_AGENTS'))
            scrapy.log.msg('User Agent for this crawl is: {}'.
                           format(self.fixedUserAgent))

        def process_start_requests(self, start_requests, spider):
            for r in start_requests:
                r.headers.setdefault('User-Agent', self.fixedUserAgent)
                yield r

The actual answer to your question is: Check for the UA by using a local webserver and see check the logs (e.g. /var/log/apache2/access.log on *NIX).

View more solutions

19,986

Author by

Alok

Updated on June 03, 2022

Comments

Alok almost 2 years

I am learning Scrapy, a web crawling framework.

I know I can set USER_AGENT in settings.py file of the Scrapy project. When I run the Scrapy, I can see the USER_AGENT's value in INFO logs.
This USER_AGENT gets set in every download request to the server I want to crawl.

But I am using multiple USER_AGENT randomly with the help of this solution. I guess this randomly chosen USER_AGENT would be working. I want to confirm it. So, how I can make Scrapy shows USER_AGENT per download request so I can see the value of USER_AGENT in the logs?
theotheo about 9 years

But why didn't add it to PyPI?
alecxe almost 9 years

@theotheo done, please see pypi.python.org/pypi/scrapy-fake-useragent. Thanks for the idea.
Javed over 7 years

@alecxe I am using fake-useragent in my project but it is throwing " raise FakeUserAgentError('Error occurred during getting browser') # noqa FakeUserAgentError: Error occurred during getting browser" Error.
alecxe over 7 years

@daved are you still having this problem? From time to time, there are temporary issues related to connecting to the source of real world user agents..
Javed over 7 years

@alecxe yes, this problem was not there 2 weeks ago. Can I do anything about it? is there any way to avoid it? Thanks.
Javed over 7 years

I found that the error occurs while using the internet explorer browser: from fake_useragent import UserAgent ua = UserAgent() ua.ie #error ua.msie #error ua['Internet Explorer'] #error other browsers are working..
alecxe over 7 years

@daved interesting! Looks like this issue is exactly about your problem.
Javed over 7 years

@alecxe I slightly changed your scrapy_fake_agent middlewear to handle this exception. What I did is whenever exception occurred I am calling get random browser to download recursively untill the exception does not occur - until internet explorer is not getting called. This way it is skipping the internet explorer browser which is throwing the exception. And thanks for your reply and your software module.
Raheel almost 7 years

it was working perfect for me, before i implement scrapy-splash in my requests. It seems like splash doesn't care about random user agent middleware inside scrapy ? Anyone can help on it ?
Dot4inch almost 6 years

Sorry, how can I use this class inside the spider for affectively rotate ua?