keepalived VRRP_script not failing over

31,274

Solution 1

I had exactly the same issue however my problem was not in the firewall nor in my Ethernet adapter but in the "weight" settings of the check script.

This was my configuration:

MASTER:

vrrp_instance haproxy {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1

BACKUP:

vrrp_instance haproxy {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1

Check_script:

vrrp_script chk_haproxy {
   script "python /root/ha_check.py"
   interval 2     # check every 2 seconds
   weight 2
   rise 2
   fall 2

}

The reason the master was refusing to release the VIP was because despite the fact the script had failed, the master was still having higher priority number from the BACKUP server. This happened because the "weight" setting on check_script was not enough to cover the "GAP" between the priority number, meaning raising the priority number of the BACKUP server greater to the one of MASTER Server. I will further explain:

According to the manual of keepalived, a positive number on the "weight" setting will add that number to the priority if the check succeeds.
A negative number will subtract that number from priority number if the check fails.

So, according to my configuration:

Server Priorities Prior failure of the script:
MASTER: 152
BACKUP: 100
Failover_IP: MASTER

The failover ip is correctly "grabbed" by master server since Master has higher priority compared to Backup server (152 > 100)

Server Priorities AFTER failure of the script:
MASTER server: 148
BACKUP server: 102
Failover_IP: STILL ON MASTER

The failover ip is still on master server because Master has again higher priority compared to BACKUP (148 > 102). The MASTER server was refusing to release the IP and right he did since his priority was higher than the other server.

The solution on my situation was:

Solution -1 : Change the priority number of both servers so they dont have much "GAP".
For example:
Master Priority: 150
Backup Priority: 149
Check_script weight: As it is ( 2 ).

With the above configuration, when the script succeeds (meaning all is ok) the priorities would be:
Master: 152
Backup: 149
IP_Location: On Master (152 > 149)

When script fails:
Master: 150
Backup: 151
IP_Location: On Backup (151 > 150)

Solution - 2: Change the weight number of the script from 2, to -60

Solution 2

I've had the same issue - two CentOS 7.1 servers with track_script, and failing the vrrp_script on the MASTER would only result in the lone log message "VRRP_Script(chk_script) failed", not in a failover. On the BACKUP server, however, I got a lot of messages of keepalived trying to take over the virtual IP for as long as I had the track_script on the MASTER server fail.

Solution in my case: The firewall (iptables) on the MASTER server wasn't configured correctly to allow VRRP packets / multicast packets, while at the same time the firewall on the other server, the BACKUP, was configured correctly.

I had entered the same iptables rules into both servers as follows:

iptables -A INPUT -i eth0 -d 224.0.0.0/8 -j ACCEPT
iptables -A INPUT -p vrrp -i eth0 -j ACCEPT

This worked on one of the servers (the BACKUP VRRP server) but not the MASTER one because I'd forgotten that the interface wasn't named 'eth0' on the MASTER server, thus the two rules had no effect at all.

This explained the behavior I'd observed:

If keepalived cannot see any other VRRP speaker for a certain virtual_router_id, it still believes itself to be the one with the highest priority (thus rightful MASTER) even after a negative weight modification as it never receives VRRP messages with a priority higher than its own (because advertisements of other speakers are blocked by the firewall and can never reach the keepalived process to make it aware of them). That's why you don't see it release the VIP.

The BACKUP server, however, was able to see the adverts of the (now failed) MASTER, found the priority in those packets reduced to a value less than its own, and proceeded to declare itself MASTER and send gratuitous ARPs to claim the VIP. So we ended up in a situation where both servers thought they'd need to serve the VIP as MASTER.

Conclusions: - Always check the firewall configuration on all VRRP speakers if you experience strange behavior (no failover, several MASTERs). Keepalived logging isn't quite as helpful as it could be (a simple message "VIP not released because I'm still highest prio" after the "VRRP_Script(chk_script) failed" line would've eased troubleshooting immensely.

  • A track_script is not an on/off type of switch ("if script OK: eligible for VIP election; if NOT OK: completely ineligible for VIP election") - it merely increases / decreases the likelihood of winning the election, and if keepalived only ever observes itself as the only VRRP speaker and never receives any messages of other speakers, there's not much of an election really - you always win.
Share:
31,274

Related videos on Youtube

Nvasion
Author by

Nvasion

Updated on September 18, 2022

Comments

  • Nvasion
    Nvasion almost 2 years

    So I am running keepalived on two servers and I can't get it to failover to the other.

    Below I have my config for one of the servers. The only different between the two is the priority numbers master being 110 and back being 109.

    But when I stop my process with /etc/init.d/process stop keepalived doesn't fail over. I just get the VRRP_Script(chk_script) failed and nothing else. No failovers or nothing.

    vrrp_script chk_script {
    script "/usr/local/bin/failover.sh"
    interval 2
    weight 2
    }
    
    vrrp_instance HAInstance {
    state BACKUP
    interface eth0
    virtual_router_id 8
    priority 109
    advert_int 1
    nopreempt
    vrrp_unicast_bind 10.10.10.8
    vrrp_unicast_peer 10.10.10.9
    virtual_ipaddress {
      10.10.10.10/16 dev eth0
    }
    notify /usr/local/bin/keepalivednotify.sh
    track_script {
      chk_script weight 20
    }
    }
    

    This is my chk_script below. The same problem also happens when I do killall -0 process as my script.

    !/bin/bash
    SERVICE='process'
    STATUS=$(ps ax | grep -v grep | grep $SERVICE)
    
    if [ "$STATUS" != "" ]
    then
        exit 0
    else
        exit 1
    fi
    

    Does anyone know a fix for this? Thanks.

    • Jim G.
      Jim G. almost 9 years
      Does your backup instance notice the priority change or log anything? Logs from both would be helpful.
    • Nvasion
      Nvasion almost 9 years
      No it does not. The only time it notices a change is when the master goes away. Such as when I stop keepalived. Stopping the process i am monitoring only shows VRRP_Script(chk_script) failed on the master. With nothing on the slave.
  • Oscar
    Oscar about 7 years
    It also seems like not specifying a weight at all means that a failed track_script will trigger the fault state directly
  • Ankur Soni
    Ankur Soni about 6 years
    @Nvasion : Kindly accept this answer as I too got my issue resolved.
  • salvador
    salvador about 4 years
    A RFC (request for comments) is not a good source of information to configure a software, because that software might not follow the recommendations of that RFC.