How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

10,199

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

Share:
10,199
Marc
Author by

Marc

Updated on June 04, 2022

Comments

  • Marc
    Marc almost 2 years

    I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

    The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

            <tr>
               <td class="xmlLink">
                 <a href="http://feeds.example.com/subject1">subject1</a>
               </td>   
            </tr>
            <tr>
               <td class="xmlLink">
                 <a href="http://feeds.example.com/subject2">subject2</a>
               </td>
            </tr>
    

    Once you click that link it brings you the the articles for that RSS category that looks like this:

       <li class="regularitem">
        <h4 class="itemtitle">
            <a href="http://example.com/article1">article1</a>
        </h4>
      </li>
      <li class="regularitem">
         <h4 class="itemtitle">
            <a href="http://example.com/article2">article2</a>
         </h4>
      </li>
    

    As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

    Here is my crawler code:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from tutorial.items import exampleItem
    
    
    class MetaCrawl(CrawlSpider):
        name = 'metaspider'
        start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
            Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]
    
        def parse_articles(self, response):
            hxs = HtmlXPathSelector(response)
            meta = hxs.select('//meta')
            items = []
            for m in meta:
               item = exampleItem()
               item['link'] = response.url
               item['meta_name'] =m.select('@name').extract()
               item['meta_value'] = m.select('@content').extract()
               items.append(item)
            return items
    

    However this is the output when I run the crawler:

    DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
    DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)
    

    What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

    EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from reuters.items import exampleItem
    from scrapy.http import Request
    
    class MetaCrawl(CrawlSpider):
        name = 'metaspider'
        start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
        rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
                 Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]
    
    
        def parse(self, response):       
            hxs = HtmlXPathSelector(response)
            meta = hxs.select('//td[@class="xmlLink"]/a/@href')
            for m in meta:
                yield Request(m.extract(), callback = self.parse_link)
    
    
        def parse_link(self, response):       
            hxs = HtmlXPathSelector(response)
            meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
            for m in meta:
                yield Request(m.extract(), callback = self.parse_again)    
    
        def parse_again(self, response):
            hxs = HtmlXPathSelector(response)
            meta = hxs.select('//meta')
            items = []
            for m in meta:
                item = exampleItem()
                item['link'] = response.url
                item['meta_name'] = m.select('@name').extract()
                item['meta_value'] = m.select('@content').extract()
                items.append(item)
            return items
    
  • Marc
    Marc about 11 years
    Yes, you are right i forgot to put the "items.append(item)" into the post. However it still gets me the same output. I will do an edit now.