Running Multiple spiders in scrapy for 1 website in parallel?
Solution 1
I think what you are looking for is something like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
You can read more at: running-multiple-spiders-in-the-same-process.
Solution 2
Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
Solution 3
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(Second Solution):
Because spiders.list()
is deprecated in Scrapy 1.4 Yuda solution should be converted to something like
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()
parik
Updated on June 14, 2022Comments
-
parik almost 2 years
I want to crawl a website with 2 parts and my script is not as fast as I need.
Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?
I tried to have 2 different classes, and run them
scrapy crawl firstSpider scrapy crawl secondSpider
but i think that it is not smart.
I read the documentation of scrapyd but I don't know if it's good for my case.
-
vdkotian almost 6 yearsWhat if say I have 300 spiders which need to run. Can this implementation hold ?
-
Pyd over 2 yearsCan we also store the results separately??
-
Kingname over 2 yearsThis solution will not run all spider parallelly. Spiders will be run one by one from last to first.
-
Adeena Lathiya about 2 years@ScottStafford How will you run the shell?