Scrapy XPath all the links on the page

13,981

You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url= Field()


class MySpider(CrawlSpider):
    name = 'twitter.com'
    allowed_domains = ['twitter.com']
    start_urls = ['http://www.twitter.com']

    rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )

    def parse_url(self, response):
        item = MyItem()
        item['url'] = response.url
        return item

Then, in the output file, I see:

http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...

Hope that helps.

Share:
13,981
B.Mr.W.
Author by

B.Mr.W.

SOreadytohelp I am a business data analyst who use R and Python. Started recently learning Apache Spark. I am a firm believer of open source software.

Updated on June 04, 2022

Comments

  • B.Mr.W.
    B.Mr.W. almost 2 years

    I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.

    Key: the current Url Value: all the links on this page.

    class MySpider(CrawlSpider):
        name = 'abc.com'
        allowed_domains = ['abc.com']
        start_urls = ['http://www.abc.com']
    
        rules = (Rule(SgmlLinkExtractor()), )
        def parse_item(self, response):
            hxs = HtmlXPathSelector(response)
            item = AbcItem()
            item['key'] = response.url 
            item['value'] = hxs.select('//a/@href').extract()
            return item 
    

    I define my AbcItem() looks like below:

    from scrapy.item import Item, Field

    class AbcItem(Item):
    
        # key: url
        # value: list of links existing in the key url
        key = Field()
        value = Field()
        pass
    

    And when I run my code like this:

    nohup scrapy crawl abc.com -o output -t csv &
    

    The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?

  • B.Mr.W.
    B.Mr.W. over 10 years
    thanks a lot for your answer, callback makes it work! you did follow=False in your code so it only scrape the twitter.com right? No crawling in this case..? I am pretty sure we have a similar RULE right now with the follow flag set to False, but seems like my spider is still crawling... doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
  • alecxe
    alecxe over 10 years
    Sure, it's scraping just index page twitter.com, but if I set follow=True it will follow the links and scrape links there too.
  • The Bumpaster
    The Bumpaster almost 8 years
    And how can I write a filter to search for any html element that contains id/class/data-type/value of name or what so ever =
  • Roman
    Roman about 7 years
    For python3, use from scrapy.linkextractors import LinkExtractor instead of SgmlLinkExtractor
  • PlsWork
    PlsWork almost 5 years
    Also, try from scrapy.spiders import CrawlSpider, Ruleinstead of from scrapy.contrib.spiders import CrawlSpider, Rule
  • PlsWork
    PlsWork almost 5 years
    Looks like twitter doesn't want to be crawled: 2019-05-12 15:18:34 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://www.twitter.com>
  • x89
    x89 almost 3 years
    stackoverflow.com/questions/68193300/… Could you take a look here? @alecxe