Scrapy crawler in Cron job

16,172

Solution 1

I solved this problem including PATH into bash file

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

Solution 2

Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name

Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name. I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

Solution 3

For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1

Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)

Solution 4

Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:

    crontab -e 

to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.

    PATH=/usr/local/bin
    */5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
    * 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2

All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.

Solution 5

Check where scrapy is installed using "which scrapy" command. In my case, scrapy is installed in /usr/local/bin.

Open crontab for editing using crontab -e. PATH=$PATH:/usr/local/bin export PATH */5 * * * * cd /myfolder/path && scrapy crawl spider_name

It should work. Scrapy runs every 5 minutes.

Share:
16,172
Beka Tomashvili
Author by

Beka Tomashvili

Software Crafter & Architect, Helping to get projects up and running Capable of programming in Python, Javascript, and Go. Familiar with the applications lifecycle on how to provide full software development processes from scratch to delivery.

Updated on June 05, 2022

Comments

  • Beka Tomashvili
    Beka Tomashvili almost 2 years

    I want to execute my scrapy crawler from cron job .

    i create bash file getdata.sh where scrapy project is located with it's spiders

    #!/bin/bash
    cd /myfolder/crawlers/
    scrapy crawl my_spider_name
    

    My crontab looks like this , I want to execute it in every 5 minute

     */5 * * * * sh /myfolder/crawlers/getdata.sh 
    

    but it don't works , whats wrong , where is my error ?

    when I execute my bash file from terminal sh /myfolder/crawlers/getdata.sh it works fine