keepalived doesn't detect loss of virtual IP

10,006

Solution 1

We experienced this issue and decided it is an issue with systemd-networkd in ubuntu 18.04 now using netplan. A newer version of keepalived should fix this as it can detect the removal of the floating IP which causes a failover, see https://github.com/acassen/keepalived/issues/836.

The newer version of keepalived is not available in 18.04, and rather than trying to backport we decided to stay on ubuntu 16.04 and wait until ubuntu 20.04 for our servers that use keepalived.

Solution 2

This issue is fixed in keepalived 2.0.0 from 2018-05-26, see changelog of keepalived

  • Monitor VIP/eVIP deletion and transition to backup if a VIP/eVIP is removed unloes it is configured with the no-track option.
Share:
10,006

Related videos on Youtube

dortegaoh
Author by

dortegaoh

Updated on September 18, 2022

Comments

  • dortegaoh
    dortegaoh almost 2 years

    I'm using keepalived to switch a floating IP between two VMs.

    /etc/keepalived/keepalived.conf on VM 1:

    vrrp_instance VI_1 {
        state MASTER
        interface ens160
        virtual_router_id 101
        priority 150
        advert_int 1
        authentication {
            auth_type PASS
            auth_pass secret
        }
        virtual_ipaddress {
            1.2.3.4
        }
    }
    

    /etc/keepalived/keepalived.conf on VM 2:

    vrrp_instance VI_1 {
        state MASTER
        interface ens160
        virtual_router_id 101
        priority 100
        advert_int 1
        authentication {
            auth_type PASS
            auth_pass secret
        }
        virtual_ipaddress {
            1.2.3.4
        }
    }
    

    This basically works fine, with one exception: Everytime systemd gets updated (it's running Ubuntu 18.04) it reloads it's network component, resulting in dropping the floating IP because it's not configured in the system. Since both keepalived instances still can ping each other, none of them sees anything wrong and none of them reacts on this, resulting in the floating IP staying down.

    I found that you can check for the IP with a simple script like this:

    vrrp_script chk_proxyip {
        script "/sbin/ip addr |/bin/grep 1.2.3.4"
    }
    
    vrrp_instance VI_1 {
        # [...]
        track_script {
            chk_proxyip
        }
    }
    

    But I'm not sure if this is a working approach.

    If I understand it correctly the following would happen, if I configure this script on VM1:

    1. VM1 loses the IP due to a systemd restart
    2. keepalived on VM1 detects the loss of the IP
    3. keepalived switches to FAULT state and stops broadcasting vrrp packages
    4. keepalived on VM2 detects the loss of keepalived on VM1 and puts the floating IP up

    At this point the IP is working again on VM2, but VM1 would stay in this state because the IP never comes up again on VM1. If VM2 goes down (for whatever reason) VM1 wouldn't take it over, because it is still in FAULT state.

    How can I ensure that the floating IP is always up on one of the VMs?

    Further tests:

    I tried to ping the floating IP instead of checking if it is active on a specific host in a check_script:

    vrrp_script chk_proxyip {
        script "/bin/ping -c 1 -w 1 1.2.3.4"
        interval 2
    }
    

    Configuring this script on node 2 resulted in the following:

    1. removed the IP on node 1 for testing
    2. node 2 detected the IP loss and changed from BACKUP to FAULT
    3. node 1 ignored the state change and stayed MASTER

    The result: the IP stayed down.

    Configuring the script on node 1 resulted in the following:

    1. removed the IP on node 1
    2. node 1 detected the IP loss and changed from MASTER to FAULT
    3. node 2 detected the state change on node 1 and changed from BACKUP to MASTER, configuring the floating IP on node 2

    Well, and then ...

    Feb 13 10:11:26 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Transition to MASTER STATE
    Feb 13 10:11:27 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering MASTER STATE
    Feb 13 10:11:29 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Received advert with higher priority 150, ours 100
    Feb 13 10:11:29 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering BACKUP STATE
    Feb 13 10:11:32 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Transition to MASTER STATE
    Feb 13 10:11:33 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering MASTER STATE
    Feb 13 10:11:36 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Received advert with higher priority 150, ours 100
    Feb 13 10:11:36 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering BACKUP STATE
    Feb 13 10:11:38 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Transition to MASTER STATE
    Feb 13 10:11:39 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering MASTER STATE
    Feb 13 10:11:41 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Received advert with higher priority 150, ours 100
    Feb 13 10:11:41 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering BACKUP STATE
    Feb 13 10:11:44 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Transition to MASTER STATE
    Feb 13 10:11:45 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Entering MASTER STATE
    Feb 13 10:11:47 node2 Keepalived_vrrp[3486]: VRRP_Instance(VI_1) Received advert with higher priority 150, ours 100
    ...
    

    I had to restart keepalived on node1 to stop the ping pong game between the nodes.

    • c4f4t0r
      c4f4t0r over 5 years
      why don't use a true cluster soluction as pacemaker?
    • dortegaoh
      dortegaoh over 5 years
      I haven't used pacemaker before, that's why I went with keepalived. But everything I read about pacemaker vs keepalived suggests that for this use case keepalived is the better choice.
    • yagmoth555
      yagmoth555 over 5 years
      Can I ask what service the vip offer ? as if it's like nginx there is some script to use with nginx+keepalaive like you do, like shown there; docs.nginx.com/nginx/deployment-guides/… I guess the goal is to change the service binding to the correct IP if the VM get the VIP.
    • dortegaoh
      dortegaoh over 5 years
      It's haproxy which balances between multiple LDAP servers.
  • dortegaoh
    dortegaoh over 5 years
    systemctl is-active network.service always shows inactive, so that doesn't really work. But even if it would be active, I don't really think the fact that an additional IP that is not configured permanently in the system exists or not exists would make any difference on the service. Thanks anyway for the input.
  • clockworknet
    clockworknet over 5 years
    You said: "Everytime systemd gets updated (it's running Ubuntu 18.04) it reloads it's network component" so that is the condition you need to test for, not whether the VIP that is dependent on that condition is present. As I mentioned, I wasn't able to test as I didn't have a machine that I could disturb the network, so take your point that you will need to find an alternative test, but that fundamentally is what you need to do.
  • dortegaoh
    dortegaoh over 5 years
    Ok, I just tried reinstalling systemd while constantly running systemctl is-active network.service. You are right, this command reports a different result during the update process, but only very briefly. Running this command every 2 seconds (or even 1 second) by keepalived and hoping it runs in the few milliseconds it takes systemd to reconfigure the network is not a check I want to trust for one of my main infrastructure services.
  • c4f4t0r
    c4f4t0r over 5 years
    @Gerald Schneider keepalived doesn't manage split brain :)
  • mp3foley
    mp3foley over 5 years
    @GeraldSchneider have you tried monitoring age of files in /var/run/systemd/netif/ for detecting systemd-networkd restarting? I see they change when systemd-networkd is restarted.
  • dortegaoh
    dortegaoh over 5 years
    Thanks, this solved the issue.
  • Rufinus
    Rufinus almost 5 years
    the double ips could be fixed if you give the netowork to the virtual_adress as in 1.2.3.4/24 instead of 1.2.3.4