Passing arguments to process.crawl in Scrapy python
Solution 1
pass the spider arguments on the process.crawl
method:
process.crawl(spider, input='inputargument', first='James', last='Bond')
Solution 2
You can do it the easy way:
from scrapy import cmdline
cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split())
Solution 3
if you have Scrapyd and you want to schedule the spider, do this
curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername -d first='James' -d last='Bond'
yusuf
A person who sees the good in things has good thoughts. And he who has good thoughts receives pleasure from life...
Updated on June 18, 2022Comments
-
yusuf almost 2 years
I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json
My script is as follows :
import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start()
I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and last are None (printed in (2)), if so, then there is no point of creating the object spider and how is it possible to pass the arguments first and last to process.crawl()?
linkedin_anonymous :
from logging import INFO import scrapy class LinkedInAnonymousSpider(scrapy.Spider): name = "linkedin_anonymous" allowed_domains = ["linkedin.com"] start_urls = [] base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search" def __init__(self, input = None, first= None, last=None): self.input = input # source file name self.first = first self.last = last def start_requests(self): print self.first ## <------------- (2) if self.first and self.last: # taking input from command line parameters url = self.base_url % (self.first, self.last) yield self.make_requests_from_url(url) def parse(self, response): . . .