Scrapy crawler in Cron job
Solution 1
I solved this problem including PATH into bash file
#!/bin/bash
cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name
Solution 2
Adding the following lines in crontab -e
runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer
PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name
Without setting $PATH
, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.
Note that the complete path for my scrapy project is /home/user/project_folder/project_name
. I ran the env command in cron and noticed that the working directory is /home/user
. Hence I skipped /home/user
in my crontab above
The cron log can be helpful while debugging
grep CRON /var/log/syslog
Solution 3
For anyone who used pip3
(or similar) to install scrapy
, here is a simple inline solution:
*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1
Replace:
*/10 * * * *
with your cron pattern
~/project/path
with the path to your scrapy project (where your scrapy.cfg
is)
something
with the spider name (use scrapy list
in your project to find out)
~/crawl.log
with your log file position (in case you want to have logging)
Solution 4
Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:
crontab -e
to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.
PATH=/usr/local/bin
*/5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
* 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2
All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.
Solution 5
Check where scrapy is installed using "which scrapy" command.
In my case, scrapy is installed in /usr/local/bin
.
Open crontab for editing using crontab -e
.
PATH=$PATH:/usr/local/bin
export PATH
*/5 * * * * cd /myfolder/path && scrapy crawl spider_name
It should work. Scrapy runs every 5 minutes.
Beka Tomashvili
Software Crafter & Architect, Helping to get projects up and running Capable of programming in Python, Javascript, and Go. Familiar with the applications lifecycle on how to provide full software development processes from scratch to delivery.
Updated on June 05, 2022Comments
-
Beka Tomashvili almost 2 years
I want to execute my scrapy crawler from cron job .
i create bash file getdata.sh where scrapy project is located with it's spiders
#!/bin/bash cd /myfolder/crawlers/ scrapy crawl my_spider_name
My crontab looks like this , I want to execute it in every 5 minute
*/5 * * * * sh /myfolder/crawlers/getdata.sh
but it don't works , whats wrong , where is my error ?
when I execute my bash file from terminal sh /myfolder/crawlers/getdata.sh it works fine