Getting scrapy project settings when script is outside of root directory
Solution 1
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper
was executed in the my_project folder, I've added a run_scraper.py
file to the outer scraper folder, a main.py
file to my root folder, and quotes_spider.py
to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py
file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
And repeat for all the settings variables you have!
Solution 2
It should work , can you share your scrapy log file
Edit: your approach will not work because ...when you execute the script..it will look for your default settings in
- if you have set the environment variable ENVVAR
- if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
- else it will run with vanilla settings provided by scrapy ( your case)
Solution 1 create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2 make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
- Line 1: Cd to first projects location ( root directory)
- Line 2 : Python script1.py
- Line 3. Cd to second projects location
- Line 4: python script2.py
Now you can execute spiders via this sh file when requested by django
Solution 3
I have used this code to solve the problem:
from scrapy.settings import Settings
settings = Settings()
settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')
settings.setmodule(settings_module_path, priority='project')
print(settings.get('BASE_URL'))
Solution 4
this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings()
.
You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
Related videos on Youtube
loremIpsum1771
Updated on September 05, 2022Comments
-
loremIpsum1771 over 1 year
I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl(): settings = get_project_settings() settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)') process = CrawlerProcess(settings) process.crawl(MySpider3) process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import * from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider from vividseats_scraper.vividseats_scraper import * from vividseats_scraper.vividseats_scraper.spiders import vs_spider tc_spider.spiderCrawl() vs_spider.spiderCrawl()
-
loremIpsum1771 almost 9 yearsWhere do I find the log file? Also, I previously posted a question about running a bash script to run the spiders like you suggested but someone suggested just running the spiders in a celery task.
-
MrPandav over 8 yearslog.start(logfile='my.log') ...right before you start your crawl - and it will save logs in my.log file.
-
loremIpsum1771 over 8 yearsI'm not sure how I can use the dictionary to specify the settings outside of the project. Could you clarify with a code example? Btw, an alternative I was trying was essentially putting the script inside of the project root and then calling the function from within the script from another script in a higher directory (so this higher script can call scripts from other projects). This does not seem to be working though, sorry if it sounds confusing.
-
M.T almost 8 yearsCould you specify how your Solution 1 would/should look like, or provide a link to document that this works?
-
MrPandav almost 8 yearsbasically, copy the cfg file to directory from where you want to run this project, next change the path in scrapy.cfg default = <parent-folder2>.<parent-folder1>.project.settings
-
MrPandav almost 8 yearsthen in your setttings.py file change all the path with respect to current folder path i.e change path of SPIDER_MODULES, SPIDER_MIDDLEWARES, ITEM_PIPELINES , exporters ...etc
-
MrPandav almost 8 yearsSPIDER_MODULES = ['project.spiders'] would become SPIDER_MODULES = [<parent-folder2>.<parent-folder1>.project.spiders]
-
Marco D.G. over 5 yearsThx, You saved my day, I had to modify settings.py as the last note explain.
-
Predicate almost 2 yearsThis is brilliant, Thank you very much!