Scrapy: ERROR: Spider error processing

14,983

You are trying to access an element that doesn't exist, the error is in this line

item['state'] =  hxs.xpath('//*[@id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')

Problably

item['state'] =  hxs.xpath('//*[@id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()

is empty and you are trying to access the first element. You have two options:

Share:
14,983
talmosko
Author by

talmosko

Hi! I am Software Engineering undergraduate student at Ben Gurion University, Israel. i know Java, css, html, c, python, linux.. Feel free to ask for any help!

Updated on June 04, 2022

Comments

  • talmosko
    talmosko almost 2 years

    i am new in python & scrapy. i tried to run existing code, but i got this error on every address:

    >     2015-07-02 01:52:19 [scrapy] DEBUG: Crawled (200) <GET http://www.tripadvisor.com/ShowUserReviews-g187147-d197524-r281927613-Hotel_Mirific_Opera-Paris_Ile_de_France.html>
    > (referer:
    > http://www.tripadvisor.com/Hotel_Review-g187147-d197524-Reviews-Hotel_Mirific_Opera-Paris_Ile_de_France.html)2015-07-02
    > 01:52:19 
    >     [scrapy] ERROR: Spider error processing <GET http://www.tripadvisor.com/ShowUserReviews-g187147-d197524-r281927613-Hotel_Mirific_Opera-Paris_Ile_de_France.html>
    > (referer:
    > http://www.tripadvisor.com/Hotel_Review-g187147-d197524-Reviews-Hotel_Mirific_Opera-Paris_Ile_de_France.html)
    > 
        > Traceback (most recent call last):   File
        > "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line
        > 102, in iter_errback
        >     yield next(it)   File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py",
        > line 28, in process_spider_output
        >     for x in result:   File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py",
        > line 22, in <genexpr>
        >     return (_set_referer(r) for r in result or ())   File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py",
        > line 37, in <genexpr>
        >     return (r for r in result or () if _filter(r))   File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py",
        > line 54, in <genexpr>
        >     return (r for r in result or () if _filter(r))   File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/crawl.py", line
        > 67, in _parse_response
        >     cb_res = callback(response, **cb_kwargs) or ()   File "/home/talmosko/Documents/scrapy/tripAdvisor/spiders/tripAdvisor.py",
        > line 30, in parse_item
        >      item['state'] =  hxs.xpath('//*[@id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii',
        > errors='ignore')
        > 
        > IndexError: list index out of range
    

    this is my code: http://pastebin.com/XzM5DrDD

    What is the problem? it seems like the spide didnt get an answer..

    Thanks!

  • talmosko
    talmosko almost 9 years
    the problem is not that i got nothing as response?
  • fasouto
    fasouto almost 9 years
    You are writing an scraper, in the same site some pages could have a piece of information while others don't. I won't check if all the pages in tripadvisor have 'state'