keepalived - random re-elections

high-availability failover keepalived vrrp

5,455

The problem is that you use the default state MASTER for the backup nodes. They should state BACKUP.

  vrrp_instance VIP_61 {
      interface bond0
      virtual_router_id 61
      state BACKUP
      priority 98
      ...

Hope this solves your mistery.

5,455

milosgajdos

Go, Docker, Kubernetes, Machine Learning

Updated on September 18, 2022

Comments

milosgajdos almost 2 years

We have set up 3 servers running keepalived . We started noticing some random re-elections occurring which we can't explain so I cam here looking for advice.

Here is our configuration:

MASTER:

global_defs {
  notification_email {
    [email protected]
  }
  notification_email_from keepalived@hostname
  smtp_server example.com:587
  smtp_connect_timeout 30
  router_id some_rate
}


vrrp_script chk_nginx {
  script "killall -0 nginx"
  interval 2
  weight 2
}

vrrp_instance VIP_61 {
  interface bond0
  virtual_router_id 61
  state MASTER
  priority 100
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass PASSWORD
  }
  virtual_ipaddress {
    X.X.X.X
    X.X.X.X
    X.X.X.X
  }
  track_script {
    chk_nginx
  }
}

BACKUP1:

global_defs {
  notification_email {
    [email protected]
  }
  notification_email_from keepalived@hostname
  smtp_server example.com:587
  smtp_connect_timeout 30
  router_id some_rate
}


vrrp_script chk_nginx {
  script "killall -0 nginx"
  interval 2
  weight 2
}

vrrp_instance VIP_61 {
  interface bond0
  virtual_router_id 61
  state MASTER
  priority 99
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass PASSWORD
  }
  virtual_ipaddress {
    X.X.X.X
    X.X.X.X
    X.X.X.X
  }
  track_script {
    chk_nginx
  }
}

BACKUP2:

    global_defs {
      notification_email {
        [email protected]
      }
      notification_email_from keepalived@hostname
      smtp_server example.com:587
      smtp_connect_timeout 30
      router_id some_rate
    }


vrrp_script chk_nginx {
  script "killall -0 nginx"
  interval 2
  weight 2
}

vrrp_instance VIP_61 {
  interface bond0
  virtual_router_id 61
  state MASTER
  priority 98
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass PASSWORD
  }
  virtual_ipaddress {
    X.X.X.X
    X.X.X.X
    X.X.X.X
  }
  track_script {
    chk_nginx
  }
}

Every now and then I can see this happening (grepped in logs):

MASTER:

Jan  6 18:30:15 lb-public01 Keepalived_vrrp[24380]: VRRP_Instance(VIP_61) Received lower prio advert, forcing new election
Jan  6 18:30:16 lb-public01 Keepalived_vrrp[24380]: VRRP_Instance(VIP_61) Received lower prio advert, forcing new election
Jan  6 18:32:37 lb-public01 Keepalived_vrrp[24380]: VRRP_Instance(VIP_61) Received lower prio advert, forcing new election

BACKUP1:

Jan  6 18:30:16 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Transition to MASTER STATE
Jan  6 18:30:16 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Received higher prio advert
Jan  6 18:30:16 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Entering BACKUP STATE
Jan  6 18:32:37 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) forcing a new MASTER election
Jan  6 18:32:38 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Transition to MASTER STATE
Jan  6 18:32:38 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Received higher prio advert
Jan  6 18:32:38 lb-public02 Keepalived_vrrp[26235]: VRRP_Instance(VIP_61) Entering BACKUP STATE

BACKUP2:

Jan  6 18:32:36 lb-public03 Keepalived_vrrp[14255]: VRRP_Script(chk_nginx) succeeded
Jan  6 18:32:37 lb-public03 Keepalived_vrrp[14255]: VRRP_Instance(VIP_61) Transition to MASTER STATE
Jan  6 18:32:37 lb-public03 Keepalived_vrrp[14255]: VRRP_Instance(VIP_61) Received higher prio advert
Jan  6 18:32:37 lb-public03 Keepalived_vrrp[14255]: VRRP_Instance(VIP_61) Entering BACKUP STATE

So MASTER receives LOWER PRIO advert and NEW election is started. WHY ? Looks like BACKUP transitions into MASTER for a short period of time (based on the logs) and then fails back to the BACKUP state. I'm quite clueless as why is this actually happening so any hints would be more than welcome.

Also, I found out there is a unicast patch in keepalived, however it's not clear to me whether it supports more than 1 unicast peer - in our case we have a cluster of 3 machines so we need more than 1 unicast peers.

Any hints on these issues would be superamazingly appreciated!

milosgajdos over 10 years

Turns out we were running old version of keepalived. We have upgraded to the latest available and now all seems to be good.
Jacob Evans about 7 years

do you use vmware snapshots....