How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?
You've returned an empty items
, you need to append item
to items
.
You can also yield item
in the loop.
Marc
Updated on June 04, 2022Comments
-
Marc almost 2 years
I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.
The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:
<tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject1">subject1</a> </td> </tr> <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject2">subject2</a> </td> </tr>
Once you click that link it brings you the the articles for that RSS category that looks like this:
<li class="regularitem"> <h4 class="itemtitle"> <a href="http://example.com/article1">article1</a> </h4> </li> <li class="regularitem"> <h4 class="itemtitle"> <a href="http://example.com/article2">article2</a> </h4> </li>
As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.
Here is my crawler code:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from tutorial.items import exampleItem class MetaCrawl(CrawlSpider): name = 'metaspider' start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True), Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')] def parse_articles(self, response): hxs = HtmlXPathSelector(response) meta = hxs.select('//meta') items = [] for m in meta: item = exampleItem() item['link'] = response.url item['meta_name'] =m.select('@name').extract() item['meta_value'] = m.select('@content').extract() items.append(item) return items
However this is the output when I run the crawler:
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss) DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)
What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.
EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from reuters.items import exampleItem from scrapy.http import Request class MetaCrawl(CrawlSpider): name = 'metaspider' start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True), Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),] def parse(self, response): hxs = HtmlXPathSelector(response) meta = hxs.select('//td[@class="xmlLink"]/a/@href') for m in meta: yield Request(m.extract(), callback = self.parse_link) def parse_link(self, response): hxs = HtmlXPathSelector(response) meta = hxs.select('//h4[@class="itemtitle"]/a/@href') for m in meta: yield Request(m.extract(), callback = self.parse_again) def parse_again(self, response): hxs = HtmlXPathSelector(response) meta = hxs.select('//meta') items = [] for m in meta: item = exampleItem() item['link'] = response.url item['meta_name'] = m.select('@name').extract() item['meta_value'] = m.select('@content').extract() items.append(item) return items
-
Marc about 11 yearsYes, you are right i forgot to put the "items.append(item)" into the post. However it still gets me the same output. I will do an edit now.