Scrapy get all links from any website

python python-3.x scrapy web-crawler

14,713

Solution 1

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

For recreating the behaviour you need in scrapy, you must

set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

An untested example (that can be, of course, refined):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

Solution 2

If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.

A simple spider that follows all links:

class FollowAllSpider(CrawlSpider):
    name = 'follow_all'

    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
        pass

14,713

Author by

Brandon Skerritt

Updated on July 25, 2022

Comments

Brandon Skerritt almost 2 years
I have the following code for a web crawler in Python 3:
```
import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

    r = requests.get(link)

    soup = BeautifulSoup(r.content, "lxml")

    if r.status_code != 200:
        print("Error. Something is wrong here")
    else:
        for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
            return_links.append(link.get('href')))

def recursive_search(links)
    for i in links:
        links.append(get_links(i))
    recursive_search(links)


recursive_search(get_links("https://www.brandonskerritt.github.io"))
```
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.

I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.

This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.

I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.

Any help will be appreciated!
JohnAndrews about 4 years

When I add deny_extensions=[], and I have a page with 4 pdf links, why does it only return one link?