Both servers running keepalived become master and have a same Virtual IP
Solution 1
The problem has been resolved.
The issue was a switch setting. When multicast filter mode is filter-all
, the problem happened. But is fixed when multicast filter mode is forward-all
.
Solution 2
Packets are not passing between machines on the em1 interface (causing a split brain scenario as Mike states).
- check your firewall to ensure packets aren't being caught
- check your networking to ensure em1 is the same network on both machines
Here's an example of what one of the packets looks like:
Frame 2: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Arrival Time: Jun 1, 2013 03:39:50.709520000 UTC
Epoch Time: 1370057990.709520000 seconds
[Time delta from previous captured frame: 0.000970000 seconds]
[Time delta from previous displayed frame: 0.000970000 seconds]
[Time since reference or first frame: 0.000970000 seconds]
Frame Number: 2
Frame Length: 54 bytes (432 bits)
Capture Length: 54 bytes (432 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:vrrp]
Ethernet II, Src: 00:25:90:83:b0:07 (00:25:90:83:b0:07), Dst: 01:00:5e:00:00:12 (01:00:5e:00:00:12)
Destination: 01:00:5e:00:00:12 (01:00:5e:00:00:12)
Address: 01:00:5e:00:00:12 (01:00:5e:00:00:12)
.... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
Source: 00:25:90:83:b0:07 (00:25:90:83:b0:07)
Address: 00:25:90:83:b0:07 (00:25:90:83:b0:07)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
Type: IP (0x0800)
Internet Protocol Version 4, Src: 10.0.10.11 (10.0.10.11), Dst: 224.0.0.18 (224.0.0.18)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
Total Length: 40
Identification: 0x8711 (34577)
Flags: 0x00
0... .... = Reserved bit: Not set
.0.. .... = Don't fragment: Not set
..0. .... = More fragments: Not set
Fragment offset: 0
Time to live: 255
Protocol: VRRP (112)
Header checksum: 0x4037 [correct]
[Good: True]
[Bad: False]
Source: 10.0.10.11 (10.0.10.11)
Destination: 224.0.0.18 (224.0.0.18)
Virtual Router Redundancy Protocol
Version 2, Packet type 1 (Advertisement)
0010 .... = VRRP protocol version: 2
.... 0001 = VRRP packet type: Advertisement (1)
Virtual Rtr ID: 254
Priority: 151 (Non-default backup priority)
Addr Count: 1
Auth Type: No Authentication (0)
Adver Int: 1
Checksum: 0x3c01 [correct]
IP Address: 10.0.0.254 (10.0.0.254)
Solution 3
For my case I had to allow multicast traffic through the firewall to 224.0.0.18
, for ufw:
ufw allow from 224.0.0.18
ufw allow to 224.0.0.18
This helped me.
Solution 4
In my case, for CentOS/RHEL 8 I only had to allow firewall rich-rule for vrrp
protocol for solving this Keepalived split-brain issue where both the servers held the VIP IP address. I had to add sysctl
kernel flag for allowing HAProxy to bind to nonlocal VIP IP.
For sysctl
, add net.ipv4.ip_nonlocal_bind = 1
in /etc/sysctl.conf
file and then do a sysctl -p
for reloading the sysctl
config. I needed this NOT for the Keepalived split-brain scenario but for having HAProxy bind to its own IP address for stats (ex: bind 192.168.0.10:1492/stats
) and bind to the VIP (virtual IP) address for load-balancing web traffic (bind 192.168.0.34:80
and bind 192.168.0.34:443
). Otherwise, the HAProxy service failed to start stating it cannot bind to ports
80and
443with the VIP IP address only. I was doing this to avoid having bind
:80and bind
:443`. Also, feels like a no-brainer but easily overlooked, check to see if you have allowed the port you are using for stats through the firewall if you are not able to reach the stats page.
For the firewall, execute the following commands:
# firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent
# firewall-cmd --reload
I found these flags and other information directly from RedHat documentation for HAProxy and Keepalived:
Firewall reference: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/load_balancer_administration/s1-lvs-connect-vsa
Nonlocal bind flag reference (this was used for HAProxy though as I was not using Keepalived for load-balancing): https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/load_balancer_administration/s1-initial-setup-forwarding-vsa
Additionally, if HAProxy still fails to bind to ports, you might want to look at the good ol' SELinux blocking it. For me, on CentOS 8 I had to do a semanage port -a -t http_port_t -p tcp 1492
for my HAProxy stats page.
Solution 5
I just came here looking for help with the same issue but none of the other answers helped. I have tracked down my issue though, so will leave it here for future web searchers.
The scenario you are running into, "split-brain" as they call it, is caused by just like another answer said: the communication between the two keepalived, specifically the multicast VRRP requests, is failing.
For me, the actual issue was that I was testing with VM's setup by libvirt with a macvtap network, which by default blocks incoming multicast requests.
The fix for me was to do virsh net-edit mymacvtapnetwork
and then change the first line, <network>
to <network trustGuestRxFilters='yes'>
For more info about the trustGuestRxFilters
setting, see the links:
- https://superuser.com/questions/944678/how-to-configure-macvtap-to-let-it-pass-multicast-packet-correctly
- https://bugzilla.redhat.com/show_bug.cgi?id=1035253#c15
I was also able to see this like OP did, by running tcpdump host 224.0.0.18
and saw the VRRP request being sent from each server, but not being received by the other.
Related videos on Youtube
riverhuang82
Updated on September 18, 2022Comments
-
riverhuang82 almost 2 years
Both two servers started keepalived, and the BACKUP server transited to MASTER STATE immediately. both two became MASTER now.
Both two nodes are sending VRRP advertisement msg.
on master server:
[root@zhsq1 ~]# tcpdump -c 3 -i em1 host 224.0.0.18 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on em1, link-type EN10MB (Ethernet), capture size 65535 bytes 11:01:35.526355 IP zhsq1 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 153, authtype simple, intvl 1s, length 20 11:01:36.526497 IP zhsq1 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 153, authtype simple, intvl 1s, length 20 11:01:37.527561 IP zhsq1 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 153, authtype simple, intvl 1s, length 20
on the backup server:
[root@zhsq2 ~]# tcpdump -c 3 -i em1 host 224.0.0.18 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on em1, link-type EN10MB (Ethernet), capture size 65535 bytes 11:11:04.314996 IP zhsq2 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 102, authtype simple, intvl 1s, length 20 11:11:05.315111 IP zhsq2 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 102, authtype simple, intvl 1s, length 20 11:11:06.316175 IP zhsq2 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 102, authtype simple, intvl 1s, length 20
below is the master server log:
May 31 11:00:22 zhsq1 Keepalived[31475]: Starting Keepalived v1.2.7 (05/20,2013) May 31 11:00:22 zhsq1 Keepalived[31476]: Starting Healthcheck child process, pid=31477 May 31 11:00:22 zhsq1 Keepalived[31476]: Starting VRRP child process, pid=31478 May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Interface queue is empty May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: No such interface, em2 May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Netlink reflector reports IP 10.0.7.60 added May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Netlink reflector reports IP fe80::92b1:1cff:fe4c:bea8 added May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Registering Kernel netlink reflector May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Registering Kernel netlink command channel May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Interface queue is empty May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: No such interface, em2 May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Netlink reflector reports IP 10.0.7.60 added May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Netlink reflector reports IP fe80::92b1:1cff:fe4c:bea8 added May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Registering Kernel netlink reflector May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Registering Kernel netlink command channel May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Registering gratuitous ARP shared channel May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Opening file '/etc/keepalived/keepalived.conf'. May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Configuration is using : 4661 Bytes May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Opening file '/etc/keepalived/keepalived.conf'. May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Configuration is using : 63856 Bytes May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: Using LinkWatch kernel netlink reflector... May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: VRRP sockpool: [ifindex(2), proto(112), fd(11,12)] May 31 11:00:22 zhsq1 Keepalived_healthcheckers[31477]: Using LinkWatch kernel netlink reflector... May 31 11:00:22 zhsq1 Keepalived_vrrp[31478]: VRRP_Script(chk_http_port) succeeded May 31 11:00:23 zhsq1 Keepalived_vrrp[31478]: VRRP_Instance(VI_1) Transition to MASTER STATE May 31 11:00:24 zhsq1 Keepalived_vrrp[31478]: VRRP_Instance(VI_1) Entering MASTER STATE May 31 11:00:24 zhsq1 Keepalived_vrrp[31478]: VRRP_Instance(VI_1) setting protocol VIPs. May 31 11:00:24 zhsq1 Keepalived_vrrp[31478]: VRRP_Instance(VI_1) Sending gratuitous ARPs on em1 for 10.0.7.65 May 31 11:00:24 zhsq1 Keepalived_healthcheckers[31477]: Netlink reflector reports IP 10.0.7.65 added May 31 11:00:29 zhsq1 Keepalived_vrrp[31478]: VRRP_Instance(VI_1) Sending gratuitous ARPs on em1 for 10.0.7.65
below is the backup server log:
May 31 11:01:50 zhsq2 Keepalived[31250]: Starting Keepalived v1.2.7 (05/20,2013) May 31 11:01:50 zhsq2 Keepalived[31251]: Starting Healthcheck child process, pid=31252 May 31 11:01:50 zhsq2 Keepalived[31251]: Starting VRRP child process, pid=31253 May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Interface queue is empty May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: No such interface, em2 May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Netlink reflector reports IP 10.0.7.61 added May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Netlink reflector reports IP fe80::92b1:1cff:fe4c:b8b7 added May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Registering Kernel netlink reflector May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Registering Kernel netlink command channel May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Interface queue is empty May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: No such interface, em2 May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Netlink reflector reports IP 10.0.7.61 added May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Netlink reflector reports IP fe80::92b1:1cff:fe4c:b8b7 added May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Registering Kernel netlink reflector May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Registering Kernel netlink command channel May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Registering gratuitous ARP shared channel May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Opening file '/etc/keepalived/keepalived.conf'. May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Configuration is using : 4661 Bytes May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Opening file '/etc/keepalived/keepalived.conf'. May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Configuration is using : 63856 Bytes May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: Using LinkWatch kernel netlink reflector... May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) Entering BACKUP STATE May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: VRRP sockpool: [ifindex(2), proto(112), fd(11,12)] May 31 11:01:50 zhsq2 Keepalived_healthcheckers[31252]: Using LinkWatch kernel netlink reflector... May 31 11:01:50 zhsq2 Keepalived_vrrp[31253]: VRRP_Script(chk_http_port) succeeded May 31 11:01:54 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) Transition to MASTER STATE May 31 11:01:55 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) Entering MASTER STATE May 31 11:01:55 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) setting protocol VIPs. May 31 11:01:55 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) Sending gratuitous ARPs on em1 for 10.0.7.65 May 31 11:01:55 zhsq2 Keepalived_healthcheckers[31252]: Netlink reflector reports IP 10.0.7.65 added May 31 11:02:00 zhsq2 Keepalived_vrrp[31253]: VRRP_Instance(VI_1) Sending gratuitous ARPs on em1 for 10.0.7.65
the master server's keepalived conf is below:
vrrp_script chk_http_port { script "/opt/nginx/nginx_pid.sh" interval 2 weight 2 } vrrp_instance VI_1 { state MASTER #nopreempt interface em1 virtual_router_id 51 priority 151 mcast_src_ip 10.0.7.60 track_interface { em1 } authentication { auth_type PASS auth_pass 1111 } track_script { chk_http_port } virtual_ipaddress { 10.0.7.65 dev em1 } }
the backup server's keepalived conf is below:
vrrp_script chk_http_port { script "/opt/nginx/nginx_pid.sh" interval 2 weight 2 } vrrp_instance VI_1 { state BACKUP interface em1 virtual_router_id 51 priority 100 mcast_src_ip 10.0.7.61 track_interface { em1 } authentication { auth_type PASS auth_pass 1111 } track_script { chk_http_port } virtual_ipaddress { 10.0.7.65 dev em1 } }
the chk_http_port file is below:
NGINX_PROCESS=`ps -C nginx --no-header | wc -l` if [ $NGINX_PROCESS -eq 0 ]; then /usr/local/nginx/sbin/nginx sleep 3 if [ `ps -C nginx --no-header | wc -l` -eq 0 ]; then killall keepalived fi fi
Please help me.
Thanks a lot.
-
Greg Petersen about 11 yearsOn both nodes:
cat /etc/keepalived/keepalived.conf
? -
riverhuang82 about 11 yearsi have updated the conf info.
-
Greg Petersen about 11 years
cat /opt/nginx/nginx_pid.sh
? Pay attention to this line VRRP_Script(chk_http_port) succeeded in the log file on BACKUP server. -
riverhuang82 about 11 yearsI have posted the shell file.
-
Mike about 11 yearsyou have a split brain due to the two keepalives not being able to talk to each other
-
-
riverhuang82 about 11 yearsthanks a lot. I will check it later. I am sure both two servers could ping each other regularly.
-
Steve Townsend about 11 yearsNothing runs over ping. Check your firewall to ensure that the appropriate packets (multicast in this case) are allowed.
-
Steve Townsend about 11 yearsBasic troubleshooting time. Stop keepalived on both, run tshark on one, see if you get packets when you start keepalived on the other.
-
Steve Townsend over 10 yearsYeah, like I said. Packets are not passing between machines on the em1 interface. :P
-
Wim Deblauwe about 10 yearsIt is also possible to use tcpdump instead of tshark: cyberciti.biz/faq/linux-unix-verify-keepalived-working-or-not
-
augurar over 8 yearsPretty poor form to accept your own answer when a correct and more complete answer is present.
-
elmonkeylp over 6 yearsAgreed with augurar.
-
Steve Townsend about 6 yearsAgain, the multicast packets aren't making it between the machines. Different underlying root cause, but same intermediate symptom. Those macvtap interfaces just cause so much unexpected trouble
-
Dan about 6 years@MikeyB do you recommend something else? I wanted the VMs to acquire IPs from DHCP over my LAN instead of from the default libvirt NAT network. besides the multicast issue, macvtap seems to be working great, but I have barely used it. should i avoid it?
-
Steve Townsend about 6 yearsSure, use a bridge network instead of macvtap.
-
Chang Zhao over 2 years
224.0.0.18
is your virtual ip ?