How to perform cron jobs failover?

9,816

Solution 1

I think heartbeat / pacemaker would be the best solution, since they can take care a lot of a lot of race conditions, fencing, etc for you in order to ensure the job only runs on one host at a time. It's possible to design something yourself, but it likely won't account for all the scenarios those packages do, and you'll eventually end up replacing most of, if not all, of the wheel.

If you don't really care about such things and you want a simpler setup. I suggest staggering the cron jobs on the servers by a few minutes. Then when the job starts on the primary it can somehow leave a marker on whatever shared resource the jobs operate on (you don't specify this, so I'm being intentionally vague). If it's a database, they can update a field in a table or if it's on a shared filesystem lock a file.

When the job runs on the second server, it can check for the presence of the marker and abort if it is there.

Solution 2

To make long story short you have to turn your cron scripts into some kind of cluster-able applications. Being the implementation as lightweight or as heavyweight as you need, they still need one thing - be able to properly resume/restart action (or recover their state) after primary node failover. The trivial case is that they are stateless programs (or "stateless enough" programs), that can be simply restarted any time and will do just fine. This is probably not your case. Note that for stateless programs you don't need failover because you could simply run them in parallel on all the nodes.

In normally complicated case, your scripts should be on cluster's shared storage, should store their state in files there, should change the state stored on disk only atomically, and should be able to continue their action from any transient state they will detect on startup.

Solution 3

Actually there is no solution that is satisfactory in this area. We have tried them all. scripting solutions, cron with heartbeat/pacemaker and more. The only solution, until recently, was a grid solution. naturally this is not what we want seeing as how a grid solution is a bit more than overkill for the scenario.

That's why I started the CronBalancer project. works exactly like a normal cron server except it's distributed, load-balanced and HA (when finished). Currently the first 2 points are finished (beta) and works with a standard crontab file.

the HA framework is in place. all that's left is the signaling needed to determine the fail-over and recovering actions.

http://sourceforge.net/projects/cronbalancer/

chuck

Solution 4

We use two approaches depending on the requirements. Both involve having the crons present and running from all machines, but with a bit of sanity checking involved:

  1. If the machines are in a primary and secondary (there may be more than one secondary) relationship then the scripts are modified to check whether the machine they are running on is a primary state. If not, then they simply exit quietly. I don't have an HB setup to hand at the moment but I believe you can query HB for this information.

  2. If all machines are eligible primaries (such as in a cluster) then some locking is used. By way of either a shared database or PID file. Only one machine ever obtains the lock status and those which don't exit quietly.

Solution 5

I had been using Nagios event handler as a simple solution.

On the NRPE server:

command[check_crond]=/usr/lib64/nagios/plugins/check_procs -c 1: -C crond
command[autostart_crond]=sudo /etc/init.d/crond start
command[stop_crond]=sudo /etc/init.d/crond stop

Don't forget to add the nagios user to the sudoers group:

nagios  ALL=(ALL)   NOPASSWD:/usr/lib64/nagios/plugins/, /etc/init.d/crond

and disable requiretty:

Defaults:nagios !requiretty

On the Nagios server:

services.cfg

define service{
    use                     generic-service
    host_name               cpc_3.145
    service_description     crond
    check_command           check_nrpe!check_crond
    event_handler           autostart_crond!cpc_2.93
    process_perf_data       0
    contact_groups          admin,admin-sms
}

commands.cfg

define command{
    command_name    autostart_crond
    command_line    $USER1$/eventhandlers/autostart_crond.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$
}

autostart_crond.sh

#!/bin/bash

case "$1" in
    OK)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c stop_crond
        ;;
    WARNING)
        ;;
    UNKNOWN)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
        ;;
    CRITICAL)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
        ;;
esac

exit 0

but I have switched to use Pacemaker and Corosync since it's the best solution to ensure that the resource only run on one node at a time.

Here're the steps what I've done:

Verify that the crond init script is LSB compliant. On my CentOS, I have to change the exit status from 1 to 0 (if start a running or stop a stopped) to match the requirements:

start() {
    echo -n $"Starting $prog: " 
    if [ -e /var/lock/subsys/crond ]; then
        if [ -e /var/run/crond.pid ] && [ -e /proc/`cat /var/run/crond.pid` ]; then
            echo -n $"cannot start crond: crond is already running.";
            failure $"cannot start crond: crond already running.";
            echo
            #return 1
            return 0
        fi
    fi

stop() {
    echo -n $"Stopping $prog: "
    if [ ! -e /var/lock/subsys/crond ]; then
        echo -n $"cannot stop crond: crond is not running."
        failure $"cannot stop crond: crond is not running."
        echo
        #return 1;
        return 0;
    fi

then it can be added to the Pacemaker by using:

# crm configure primitive Crond lsb:crond \
        op monitor interval="60s"

crm configure show

node SVR022-293.localdomain
node SVR233NTC-3145.localdomain
primitive Crond lsb:crond \
        op monitor interval="60s"
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

crm status

============
Last updated: Fri Jun  7 13:44:03 2013
Stack: openais
Current DC: SVR233NTC-3145.localdomain - partition with quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ SVR022-293.localdomain SVR233NTC-3145.localdomain ]

 Crond  (lsb:crond):    Started SVR233NTC-3145.localdomain

Testing failover by stopping Pacemaker and Corosync on 3.145:

[root@3145 corosync]# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate:          [  OK  ]
Waiting for cluster services to unload:......              [  OK  ]

[root@3145 corosync]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Waiting for corosync services to unload:.                  [  OK  ]

then check the cluster status on the 2.93:

============
Last updated: Fri Jun  7 13:47:31 2013
Stack: openais
Current DC: SVR022-293.localdomain - partition WITHOUT quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ SVR022-293.localdomain ]
OFFLINE: [ SVR233NTC-3145.localdomain ]

Crond   (lsb:crond):    Started SVR022-293.localdomain
Share:
9,816

Related videos on Youtube

Falken
Author by

Falken

Updated on September 17, 2022

Comments

  • Falken
    Falken almost 2 years

    Using two Debian servers, I need to setup a strong failover environment for cron jobs that can be only called on one server at a time.

    Moving a file in /etc/cron.d should do the trick, but is there a simple HA solution to operate such action ? And if possible not with heartbeat ;)