Scrapy: Save response.body as html file?
Solution 1
Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use
self.html_file.write(response.body.decode("utf-8"))
instead of
self.html_file.write(response.body)
also you can use
self.html_file.write(response.text)
Solution 2
The correct way is to use response.text
, and not response.body.decode("utf-8")
. To quote documentation:
Keep in mind that
Response.body
is always a bytes object. If you want the unicode version useTextResponse.text
(only available inTextResponse
and subclasses).
and
text: Response body, as unicode.
The same as
response.body.decode(response.encoding)
, but the result is cached after the first call, so you can accessresponse.text
multiple times without extra overhead.Note:
unicode(response.body)
is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.
Solution 3
Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with
statement, the example should be rewritten like:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
def parse(self, response):
with open(self.path_to_html, 'w') as html_file:
html_file.write(response.text)
yield {
'url': response.url
}
But the html_file
will only accessible from the parse
method.
bonblow
Updated on July 29, 2022Comments
-
bonblow almost 2 years
My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.
I use Python 3.6
Spider:
class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ['google.com'] start_urls = ['http://google.com/'] def __init__(self): self.path_to_html = html_path + 'index.html' self.path_to_header = header_path + 'index.html' self.html_file = open(self.path_to_html, 'w') def parse(self, response): url = response.url self.html_file.write(response.body) self.html_file.close() yield { 'url': url }
Tracktrace:
Traceback (most recent call last): File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders \example.py", line 35, in parse self.html_file.write(response.body) TypeError: write() argument must be str, not bytes