CentOS 6 interfaces bonding, round-robin instead of active-backup, duplicates frames

centos network-interface duplicate bonding

7,260

After some deeper investigation, found the problem for the round-robin and DUP problems. They are actually related.

round robin (0) instead of active-backup (1)

On CentOS 5+, and seemingly especially 6.6, it recommended / preferred to use the BONDING_OPTS parameter directly in ifcfg-bond0 (and not in the bonding module options, that makes sense)

DEVICE=bond0
...
BONDING_OPTS="mode=1 miimon=100"

(mode may be specified as '1' or as 'active-backup')
After adding the line, everything worked as expected.

duplicated ping frames

In round-robin mode, both interfaces are used. And when the interfaces are connected to two different switches, the early ping replies may be duplicated

It is not uncommon to observe a short burst of duplicated traffic when the bonding device is first used, or after it has been idle for some period of time. This is most easily observed by issuing a "ping" to some other host on the network, and noticing that the output from ping flags duplicates (typically one per slave).

For example, on a bond in active-backup mode with five slaves all connected to one switch, the output may appear as follows:

    # ping -n 10.0.4.2
    PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
    64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
    64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)

This is not due to an error in the bonding driver, rather, it is a side effect of how many switches update their MAC forwarding tables.

After switching to active-backup, no more DUPs were observed.

This is explained in details in this invaluably knowledgeable documentation

https://www.kernel.org/doc/Documentation/networking/bonding.txt

7,260

Déjà vu

[email protected] Linux & Mac.

Updated on September 18, 2022

Comments

Déjà vu almost 2 years
Two interfaces, eth0 and eth1 are part of a network bonding bond0 on CentOS 6.

All worked well under CentOS 5, but after the upgrade to CentOS 6.6, keeping the same configuration, the network works fine but
- despite setting /etc/modprobe.d/bonding.conf with options mode=1 or even mode=active-backup, the status from /proc/net/bonding/bond0 always shows load balancing (round-robin), not active-backup as it should.
- doing a ping to a LAN address (that belongs to bond0 network) for the first time after a reboot, the first frame is DUP! (duplicated), the DUP doesn't happen anymore on further pings. Likely due to round-robin instead of active-backup
/etc/modprobe.d/bonding.conf:
```
alias bond0 bonding
options bond0 mode=1 miimon=100
```
ifcfg-bond0:
```
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NETWORK=10.1.1.0
NETMASK=255.255.255.0
IPADDR=10.1.1.11
USERCTL=no
NM_CONTROLLED=no
```
ifcfg-eth0:
```
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:22:35:12:26:18
UUID=12fa32c2-e421-47f6-8d25-11414a664318
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
MASTER=bond0
SLAVE=yes
USERCTL=no
```
ifcfg-eth1:
```
DEVICE=eth1
BOOTPROTO=none
HWADDR=00:22:35:12:26:19
UUID=12fa32c2-e421-47f6-8d25-11414a664319
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
MASTER=bond0
SLAVE=yes
USERCTL=no
```
All updates have been applied. NetworkManager is disabled.

The main problem seems now to be the mode, round-robin instead of active-backup.
- Bratchley over 9 years
  
  It might also be worth it to post the stack trace somehow so we can see what threads are involved in the panic.
- Déjà vu over 9 years
  
  @Bratchley actually a number of updates + BIOS seem to have fixed the kernel panic. However, whatever the options in bonding.conf, ie mode=1 or mode=active-backup, the status from /proc/net/bonding/bond0 always shows load balancing (round-robin). I'll edit the question.
- Bratchley over 9 years
  
  It might be worth it to post that as a new question since it doesn't relate to the original issue with kernel panics or the DUP message. That would get more eyes on the problem since people see new questions before they see updated questions.
- Déjà vu over 9 years
  
  @Bratchley Added an answer that explains what happened. Thanks.