Passing arguments to process.crawl in Scrapy python

12,619

Solution 1

pass the spider arguments on the process.crawl method:

process.crawl(spider, input='inputargument', first='James', last='Bond')

Solution 2

You can do it the easy way:

from scrapy import cmdline

cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split())

Solution 3

if you have Scrapyd and you want to schedule the spider, do this

curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername -d first='James' -d last='Bond'

Share:
12,619
yusuf
Author by

yusuf

A person who sees the good in things has good thoughts. And he who has good thoughts receives pleasure from life...

Updated on June 18, 2022

Comments

  • yusuf
    yusuf almost 2 years

    I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json

    My script is as follows :

    import scrapy
    from linkedin_anonymous_spider import LinkedInAnonymousSpider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    spider = LinkedInAnonymousSpider(None, "James", "Bond")
    process = CrawlerProcess(get_project_settings())
    process.crawl(spider) ## <-------------- (1)
    process.start()
    

    I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and last are None (printed in (2)), if so, then there is no point of creating the object spider and how is it possible to pass the arguments first and last to process.crawl()?

    linkedin_anonymous :

    from logging import INFO
    
    import scrapy
    
    class LinkedInAnonymousSpider(scrapy.Spider):
        name = "linkedin_anonymous"
        allowed_domains = ["linkedin.com"]
        start_urls = []
    
        base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"
    
        def __init__(self, input = None, first= None, last=None):
            self.input = input  # source file name
            self.first = first
            self.last = last
    
        def start_requests(self):
            print self.first ## <------------- (2)
            if self.first and self.last: # taking input from command line parameters
                    url = self.base_url % (self.first, self.last)
                    yield self.make_requests_from_url(url)
    
        def parse(self, response): . . .