Scrapy: Save response.body as html file?

18,183

Solution 1

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use

 self.html_file.write(response.body.decode("utf-8"))

instead of

  self.html_file.write(response.body)

also you can use

  self.html_file.write(response.text)

Solution 2

The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:

Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

and

text: Response body, as unicode.

The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.

Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

Solution 3

Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

But the html_file will only accessible from the parse method.

Share:
18,183
bonblow
Author by

bonblow

Updated on July 29, 2022

Comments

  • bonblow
    bonblow almost 2 years

    My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.

    I use Python 3.6

    Spider:

    class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ['google.com']
        start_urls = ['http://google.com/']
    
        def __init__(self):
            self.path_to_html = html_path + 'index.html'
            self.path_to_header = header_path + 'index.html'
            self.html_file = open(self.path_to_html, 'w')
    
        def parse(self, response):
            url = response.url
            self.html_file.write(response.body)
            self.html_file.close()
            yield {
                'url': url
            }
    

    Tracktrace:

    Traceback (most recent call last):
      File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
     653, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
    \example.py", line 35, in parse
        self.html_file.write(response.body)
    TypeError: write() argument must be str, not bytes