How to make Scrapy show user agent per download request in log?
Solution 1
Just FYI.
I've implemented a simple RandomUserAgentMiddleware
middleware based on fake-useragent
.
Thanks to fake-useragent
, you don't need to configure the list of User-Agents - it picks them up based on browser usage statistics from a real-world database.
Solution 2
You can see it by using this:
def parse(self, response):
print response.request.headers['User-Agent']
You can use scrapy-fake-useragent
python library. It works perfectly and it chooses user agent based on world usage statistic. But be careful, check if it's already working perfectly using above code since you might do some mistake when applying it. Read the instruction carefully.
Solution 3
You can add logging to solution you're using:
#!/usr/bin/python
#-*-coding:utf-8-*-
import random
from scrapy import log
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
# Add desired logging message here.
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
level=log.DEBUG
)
#the default user_agent_list composes chrome,IE,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
]
Solution 4
EDIT: I came here because I was looking for the functionality to change the user agent.
Based to alecxe's RandomUserAgent, this is what I use to set the user agent only once per crawl and only from a predefined list (works for me with scrapy v. 0.24 & 0.25):
"""
Choose a user agent from the settings but do it only once per crawl.
"""
import random
import scrapy
SETTINGS = scrapy.utils.project.get_project_settings()
class RandomUserAgentMiddleware(object):
def __init__(self):
super(RandomUserAgentMiddleware, self).__init__()
self.fixedUserAgent = random.choice(SETTINGS.get('USER_AGENTS'))
scrapy.log.msg('User Agent for this crawl is: {}'.
format(self.fixedUserAgent))
def process_start_requests(self, start_requests, spider):
for r in start_requests:
r.headers.setdefault('User-Agent', self.fixedUserAgent)
yield r
The actual answer to your question is: Check for the UA by using a local webserver and see check the logs (e.g. /var/log/apache2/access.log on *NIX).
Alok
Updated on June 03, 2022Comments
-
Alok almost 2 years
I am learning Scrapy, a web crawling framework.
I know I can set
USER_AGENT
insettings.py
file of the Scrapy project. When I run the Scrapy, I can see theUSER_AGENT
's value inINFO
logs.
ThisUSER_AGENT
gets set in every download request to the server I want to crawl.But I am using multiple
USER_AGENT
randomly with the help of this solution. I guess this randomly chosenUSER_AGENT
would be working. I want to confirm it. So, how I can make Scrapy showsUSER_AGENT
per download request so I can see the value ofUSER_AGENT
in the logs? -
theotheo about 9 yearsBut why didn't add it to PyPI?
-
alecxe almost 9 years@theotheo done, please see pypi.python.org/pypi/scrapy-fake-useragent. Thanks for the idea.
-
Javed over 7 years@alecxe I am using fake-useragent in my project but it is throwing " raise FakeUserAgentError('Error occurred during getting browser') # noqa FakeUserAgentError: Error occurred during getting browser" Error.
-
alecxe over 7 years@daved are you still having this problem? From time to time, there are temporary issues related to connecting to the source of real world user agents..
-
Javed over 7 years@alecxe yes, this problem was not there 2 weeks ago. Can I do anything about it? is there any way to avoid it? Thanks.
-
Javed over 7 yearsI found that the error occurs while using the internet explorer browser:
from fake_useragent import UserAgent ua = UserAgent() ua.ie #error ua.msie #error ua['Internet Explorer'] #error
other browsers are working.. -
alecxe over 7 years@daved interesting! Looks like this issue is exactly about your problem.
-
Javed over 7 years@alecxe I slightly changed your scrapy_fake_agent middlewear to handle this exception. What I did is whenever exception occurred I am calling get random browser to download recursively untill the exception does not occur - until internet explorer is not getting called. This way it is skipping the internet explorer browser which is throwing the exception. And thanks for your reply and your software module.
-
Raheel almost 7 yearsit was working perfect for me, before i implement
scrapy-splash
in my requests. It seems like splash doesn't care about random user agent middleware inside scrapy ? Anyone can help on it ? -
Dot4inch almost 6 yearsSorry, how can I use this class inside the spider for affectively rotate ua?