bond slave interfaces not getting the same aggregator id on LACP

networking high-availability rhel6 rhel7 lacp

8,607

After digging into some documentation and some testing, I found out when a server is using bonding you need to force monitor the network links using the miimon parameter from the bonding module.

While looking at /proc/net/bonding/bond0 I should have seen one of the device has the MII status down where actually it was up on the link level.

https://access.redhat.com/articles/172483#Link_Monitoring_Modes states that:

It is critical that a link monitoring mode, either the miimon or arp_interval and arp_ip_target parameters be specified. Configuring a bond without a link monitoring mode is not a valid use of the bonding driver

So to report that in ifcfg-bond0 file you pass that in the BONDING_OPTS options

#/etc/sysconfig/network-scripts/ifcfg-bond0
...
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer2+2 miimon=100"
...

so it forces to poll the links every 100ms.

Restart the network service to apply the change.

8,607

Baptiste Mille-Mathias

I'm a versatile Kubernetes / Linux / Automation Engineer by day. I automate stuff, run production services, monitor them, correct, improve and sometimes I also break things. And so forth. I've a strong experience in operations, having managed critical systems with important traffic. I try to upvote as much as I can answers and questions. I do some fields mapping in Openstreetmap and Mapillary -- jack of all trades, master of none

Updated on September 18, 2022

Comments

Baptiste Mille-Mathias over 1 year
I have a bug on some servers where LACP (802.3ad) is not working. I have on all servers a bonding device bond0 with two eth slaves and each interface is plugged on a different swich, and both switches configured with LACP.

Everything seems to be ok, but a network engineer detected some MLAG (Arista LACP implementation) was not working while the physical devices were up.

When I looked to /proc/net/bonding/bond0 of affected servers, I found each interface has a different Aggregator ID. On nominal servers the Aggregator ID is the same.

The issue can be reproduced by switching off and on the port on the switch, then we can observe despite physical link is up, MLAG is down. The bug is present on RHEL 6 and 7 (but not all servers are affected).

Configuration
```
#/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
MACADDR=14:02:ec:44:e9:80
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer3+4"
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no

# /etc/sysconfig/network-scripts/ifcfg-eno49 (same for other interface)
HWADDR=14:02:ec:44:e9:80
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no
```
We have a workaround now - set down and up eth interface on server - but this is not ideal.

To check LACP protocol, I did
```
tcpdump -i eno49 -tt -vv -nnn ether host 01:80:c2:00:00:02
```
I can see a packet every 30 seconds on one interface but on the other I see a packet every 1 second as is it was trying to establish LACP session.

Do you have a way to troubleshoot and fix that ?

(sorry if I did not use the right term for network I'm not really skilled in LACP)

Thanks