bond slave interfaces not getting the same aggregator id on LACP
After digging into some documentation and some testing, I found out when a server is using bonding you need to force monitor the network links using the miimon
parameter from the bonding
module.
While looking at /proc/net/bonding/bond0
I should have seen one of the device has the MII status down
where actually it was up on the link level.
https://access.redhat.com/articles/172483#Link_Monitoring_Modes states that:
It is critical that a link monitoring mode, either the
miimon
orarp_interval
andarp_ip_target
parameters be specified. Configuring a bond without a link monitoring mode is not a valid use of the bonding driver
So to report that in ifcfg-bond0
file you pass that in the BONDING_OPTS
options
#/etc/sysconfig/network-scripts/ifcfg-bond0
...
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer2+2 miimon=100"
...
so it forces to poll the links every 100ms.
Restart the network service to apply the change.
Related videos on Youtube
Baptiste Mille-Mathias
I'm a versatile Kubernetes / Linux / Automation Engineer by day. I automate stuff, run production services, monitor them, correct, improve and sometimes I also break things. And so forth. I've a strong experience in operations, having managed critical systems with important traffic. I try to upvote as much as I can answers and questions. I do some fields mapping in Openstreetmap and Mapillary -- jack of all trades, master of none
Updated on September 18, 2022Comments
-
Baptiste Mille-Mathias over 1 year
I have a bug on some servers where LACP (
802.3ad
) is not working. I have on all servers a bonding devicebond0
with twoeth
slaves and each interface is plugged on a different swich, and both switches configured with LACP.Everything seems to be ok, but a network engineer detected some MLAG (Arista LACP implementation) was not working while the physical devices were up.
When I looked to
/proc/net/bonding/bond0
of affected servers, I found each interface has a differentAggregator ID
. On nominal servers theAggregator ID
is the same.The issue can be reproduced by switching off and on the port on the switch, then we can observe despite physical link is up, MLAG is down. The bug is present on RHEL 6 and 7 (but not all servers are affected).
Configuration
#/etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 MACADDR=14:02:ec:44:e9:80 IPADDR=xxx.xxx.xxx.xxx NETMASK=xxx.xxx.xxx.xxx BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer3+4" BOOTPROTO=none ONBOOT=yes USERCTL=no NM_CONTROLLED=no PEERDNS=no # /etc/sysconfig/network-scripts/ifcfg-eno49 (same for other interface) HWADDR=14:02:ec:44:e9:80 MASTER=bond0 SLAVE=yes BOOTPROTO=none ONBOOT=yes USERCTL=no NM_CONTROLLED=no PEERDNS=no
We have a workaround now - set down and up
eth
interface on server - but this is not ideal.To check LACP protocol, I did
tcpdump -i eno49 -tt -vv -nnn ether host 01:80:c2:00:00:02
I can see a packet every 30 seconds on one interface but on the other I see a packet every 1 second as is it was trying to establish LACP session.
Do you have a way to troubleshoot and fix that ?
(sorry if I did not use the right term for network I'm not really skilled in LACP)
Thanks