ReactorNotRestartable error in while loop with scrapy
Solution 1
By default, CrawlerProcess
's .start()
will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False)
if you create process
in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner
. The docs have an example on doing that.
Solution 2
I was able to solve this problem like this. process.start()
should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
Solution 3
Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
Solution 4
For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling). Then you can simply run this using different subprocesses. For this python Process module can be used. Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Solution 5
I could advice you to run scrapers using subprocess
module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()
k_wit
Updated on August 07, 2021Comments
-
k_wit almost 3 years
I get
twisted.internet.error.ReactorNotRestartable
error when I execute following code:from time import sleep from scrapy import signals from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from scrapy.xlib.pydispatch import dispatcher result = None def set_result(item): result = item while True: process = CrawlerProcess(get_project_settings()) dispatcher.connect(set_result, signals.item_scraped) process.crawl('my_spider') process.start() if result: break sleep(3)
For the first time it works, then I get error. I create
process
variable each time, so what's the problem? -
Ilia w495 Nikitin about 7 years
process.start(stop_after_crawl=False)
— will block the main process -
paul trmbrth about 7 years@Iliaw495Nikitin, CrawlerProcess.start() will run the reactor and give back control to the thread when the crawl is finished, correct. is that an issue here? The alternative scrapy.crawler.CrawlerRunner's
.crawl()
"Returns a deferred that is fired when the crawling is finished." -
Burak Kaymakci over 3 yearsBlocking wouldn't be a good idea for AWS Lambda, would it? I have literally spent half a day just to figure out how to get this running on AWS Lambda, still nothing.
-
paul trmbrth over 3 yearsI have no idea how AWS Lambda work. You may want to post a new question.