How to perform cron jobs failover?
Solution 1
I think heartbeat / pacemaker would be the best solution, since they can take care a lot of a lot of race conditions, fencing, etc for you in order to ensure the job only runs on one host at a time. It's possible to design something yourself, but it likely won't account for all the scenarios those packages do, and you'll eventually end up replacing most of, if not all, of the wheel.
If you don't really care about such things and you want a simpler setup. I suggest staggering the cron jobs on the servers by a few minutes. Then when the job starts on the primary it can somehow leave a marker on whatever shared resource the jobs operate on (you don't specify this, so I'm being intentionally vague). If it's a database, they can update a field in a table or if it's on a shared filesystem lock a file.
When the job runs on the second server, it can check for the presence of the marker and abort if it is there.
Solution 2
To make long story short you have to turn your cron scripts into some kind of cluster-able applications. Being the implementation as lightweight or as heavyweight as you need, they still need one thing - be able to properly resume/restart action (or recover their state) after primary node failover. The trivial case is that they are stateless programs (or "stateless enough" programs), that can be simply restarted any time and will do just fine. This is probably not your case. Note that for stateless programs you don't need failover because you could simply run them in parallel on all the nodes.
In normally complicated case, your scripts should be on cluster's shared storage, should store their state in files there, should change the state stored on disk only atomically, and should be able to continue their action from any transient state they will detect on startup.
Solution 3
Actually there is no solution that is satisfactory in this area. We have tried them all. scripting solutions, cron with heartbeat/pacemaker and more. The only solution, until recently, was a grid solution. naturally this is not what we want seeing as how a grid solution is a bit more than overkill for the scenario.
That's why I started the CronBalancer project. works exactly like a normal cron server except it's distributed, load-balanced and HA (when finished). Currently the first 2 points are finished (beta) and works with a standard crontab file.
the HA framework is in place. all that's left is the signaling needed to determine the fail-over and recovering actions.
http://sourceforge.net/projects/cronbalancer/
chuck
Solution 4
We use two approaches depending on the requirements. Both involve having the crons present and running from all machines, but with a bit of sanity checking involved:
If the machines are in a primary and secondary (there may be more than one secondary) relationship then the scripts are modified to check whether the machine they are running on is a primary state. If not, then they simply exit quietly. I don't have an HB setup to hand at the moment but I believe you can query HB for this information.
If all machines are eligible primaries (such as in a cluster) then some locking is used. By way of either a shared database or PID file. Only one machine ever obtains the lock status and those which don't exit quietly.
Solution 5
I had been using Nagios event handler as a simple solution.
On the NRPE server:
command[check_crond]=/usr/lib64/nagios/plugins/check_procs -c 1: -C crond
command[autostart_crond]=sudo /etc/init.d/crond start
command[stop_crond]=sudo /etc/init.d/crond stop
Don't forget to add the nagios
user to the sudoers group:
nagios ALL=(ALL) NOPASSWD:/usr/lib64/nagios/plugins/, /etc/init.d/crond
and disable requiretty
:
Defaults:nagios !requiretty
On the Nagios server:
services.cfg
define service{
use generic-service
host_name cpc_3.145
service_description crond
check_command check_nrpe!check_crond
event_handler autostart_crond!cpc_2.93
process_perf_data 0
contact_groups admin,admin-sms
}
commands.cfg
define command{
command_name autostart_crond
command_line $USER1$/eventhandlers/autostart_crond.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$
}
autostart_crond.sh
#!/bin/bash
case "$1" in
OK)
/usr/local/nagios/libexec/check_nrpe -H $4 -c stop_crond
;;
WARNING)
;;
UNKNOWN)
/usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
;;
CRITICAL)
/usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
;;
esac
exit 0
but I have switched to use Pacemaker and Corosync since it's the best solution to ensure that the resource only run on one node at a time.
Here're the steps what I've done:
Verify that the crond init script is LSB compliant. On my CentOS, I have to change the exit status from 1 to 0 (if start a running or stop a stopped) to match the requirements:
start() {
echo -n $"Starting $prog: "
if [ -e /var/lock/subsys/crond ]; then
if [ -e /var/run/crond.pid ] && [ -e /proc/`cat /var/run/crond.pid` ]; then
echo -n $"cannot start crond: crond is already running.";
failure $"cannot start crond: crond already running.";
echo
#return 1
return 0
fi
fi
stop() {
echo -n $"Stopping $prog: "
if [ ! -e /var/lock/subsys/crond ]; then
echo -n $"cannot stop crond: crond is not running."
failure $"cannot stop crond: crond is not running."
echo
#return 1;
return 0;
fi
then it can be added to the Pacemaker by using:
# crm configure primitive Crond lsb:crond \
op monitor interval="60s"
crm configure show
node SVR022-293.localdomain
node SVR233NTC-3145.localdomain
primitive Crond lsb:crond \
op monitor interval="60s"
property $id="cib-bootstrap-options" \
dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
crm status
============
Last updated: Fri Jun 7 13:44:03 2013
Stack: openais
Current DC: SVR233NTC-3145.localdomain - partition with quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ SVR022-293.localdomain SVR233NTC-3145.localdomain ]
Crond (lsb:crond): Started SVR233NTC-3145.localdomain
Testing failover by stopping Pacemaker and Corosync on 3.145:
[root@3145 corosync]# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate: [ OK ]
Waiting for cluster services to unload:...... [ OK ]
[root@3145 corosync]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:. [ OK ]
then check the cluster status on the 2.93:
============
Last updated: Fri Jun 7 13:47:31 2013
Stack: openais
Current DC: SVR022-293.localdomain - partition WITHOUT quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ SVR022-293.localdomain ]
OFFLINE: [ SVR233NTC-3145.localdomain ]
Crond (lsb:crond): Started SVR022-293.localdomain
Related videos on Youtube
Falken
Updated on September 17, 2022Comments
-
Falken almost 2 years
Using two Debian servers, I need to setup a strong failover environment for cron jobs that can be only called on one server at a time.
Moving a file in /etc/cron.d should do the trick, but is there a simple HA solution to operate such action ? And if possible not with heartbeat ;)