Scrapy get all links from any website

14,713

Solution 1

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

For recreating the behaviour you need in scrapy, you must

  • set your start url in your page.
  • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

An untested example (that can be, of course, refined):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

Solution 2

If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.

A simple spider that follows all links:

class FollowAllSpider(CrawlSpider):
    name = 'follow_all'

    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
        pass
Share:
14,713
Brandon Skerritt
Author by

Brandon Skerritt

Updated on July 25, 2022

Comments

  • Brandon Skerritt
    Brandon Skerritt almost 2 years

    I have the following code for a web crawler in Python 3:

    import requests
    from bs4 import BeautifulSoup
    import re
    
    def get_links(link):
    
        return_links = []
    
        r = requests.get(link)
    
        soup = BeautifulSoup(r.content, "lxml")
    
        if r.status_code != 200:
            print("Error. Something is wrong here")
        else:
            for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
                return_links.append(link.get('href')))
    
    def recursive_search(links)
        for i in links:
            links.append(get_links(i))
        recursive_search(links)
    
    
    recursive_search(get_links("https://www.brandonskerritt.github.io"))
    

    The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.

    I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.

    This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.

    I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.

    Any help will be appreciated!

  • JohnAndrews
    JohnAndrews about 4 years
    When I add deny_extensions=[], and I have a page with 4 pdf links, why does it only return one link?