Scrapy: Save response.body as html file?

python django scrapy web-crawler

18,183

Solution 1

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use

 self.html_file.write(response.body.decode("utf-8"))

instead of

  self.html_file.write(response.body)

also you can use

  self.html_file.write(response.text)

Solution 2

The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:

Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

and

text: Response body, as unicode.

The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.

Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

Solution 3

Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

But the html_file will only accessible from the parse method.

18,183

Author by

bonblow

Updated on July 29, 2022

Comments

bonblow almost 2 years

My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.

I use Python 3.6

Spider:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'
        self.html_file = open(self.path_to_html, 'w')

    def parse(self, response):
        url = response.url
        self.html_file.write(response.body)
        self.html_file.close()
        yield {
            'url': url
        }

Tracktrace:

Traceback (most recent call last):
  File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
    self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes