LACP with 2 NICs working when either one is down, not when both are up

14,178

The only LACP config I managed to get working in Ubuntu is this:

auto bond0
iface bond0 inet dhcp
  bond-mode 4
  bond-slaves none
  bond-miimon 100
  bond-lacp-rate 1
  bond-updelay 200 
  bond-downdelay 200

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual
  bond-master bond0

i.e. I don't use bond-slaves but bond-master. I'm not sure what the difference is but I found this config worked for me.

I don't have any issues with LACP under my setup although this is with 1Gbe networking.

In addition, if you're still getting problems, try plugging both cables into the same switch and configuring the ports for LACP. Just to eliminate the possibility of issues with multi chassis LACP.

Share:
14,178

Related videos on Youtube

Tolli
Author by

Tolli

Updated on September 18, 2022

Comments

  • Tolli
    Tolli almost 2 years

    I'm running into problems with getting a LACP trunk to operate properly on Ubuntu 12.04.2 LTS.

    My setup is a single host connected with two 10 Gbe interfaces to two seperate Nexus 5548 switches, with vPC configured to enabled multi-chassis LACP. Nexus config is as per Cisco guidelines, and Ubuntu config as per https://help.ubuntu.com/community/UbuntuBonding

    Server is connected to port Ethernet1/7 on each Nexus switch, whose ports are configured identical and placed in Port-channel 15. Port-channel 15 is configured as VPC 15, and VPC output looks good. These are simple access ports, i.e. no 801.1q trunking involved.

    Diagram:

        +----------+      +----------+      +----------+      +----------+
        | client 1 |------| nexus 1  |------| nexus 2  |------| client 2 |
        +----------+      +----------+      +----------+      +----------+
                               |                  |
                               |    +--------+    |
                               +----| server |----+
                               eth4 +--------+ eth5
    

    When either link is down, both clients 1 and 2 are able to reach the server. However, when I bring the secondary link up, the client connected to the switch with the newly-enabled link, is unable to reach the server. See the following table for state transitions and results:

       port states (down by means of "shutdown")
         nexus 1 eth1/7        up     up    down   up
         nexus 2 eth1/7       down    up     up    up
    
       connectivity
        client 1 - server      OK     OK     OK   FAIL
        client 2 - server      OK    FAIL    OK    OK
    

    Now, I belive I've isolated the issue to the Linux side. When in up-up state, each nexus uses the local link to the server to deliver the packets, as verified by looking at the mac address table. What I am able to see on the server is that the packets from each client are being received on the ethX interface (packets from client 1 on eth4, packets from client 2 on eth4) by using tcpdump -i ethX, but when I run tcpdump -i bond0 I can only traffic from either of the host (in accordance with what I stated above).

    I observe the same behaviour for ARP and ICMP (IP) traffic; ARP fails from a client when both links are up, works (along with ping) when one is down, ping fails when I enable the link again (packets are still received on eth interface, but not on bond0).

    To clarify, I'm setting up multiple servers in this configuration, and all show the same symptoms, so it doesn't appear to be hardware related.

    So - figuring out how to fix that is what I'm dealing with; my Googling has not brought me any luck so far.

    Any pointers are highly appreciated.

    /etc/network/interfaces

        auto eth4
        iface eth4 inet manual
        bond-master bond0
    
        auto eth5
        iface eth5 inet manual
        bond-master bond0
    
        auto bond0
        iface bond0 inet static
        address 10.0.11.5
        netmask 255.255.0.0
        gateway 10.0.0.3
        mtu 9216
        dns-nameservers 8.8.8.8 8.8.4.4
        bond-mode 4
        bond-miimon 100
        bond-lacp-rate 1
        #bond-slaves eth4
        bond-slaves eth4 eth5
    

    /proc/net/bonding/bond0

        A little further information:
        Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
        Bonding Mode: IEEE 802.3ad Dynamic link aggregation
        Transmit Hash Policy: layer2 (0)
        MII Status: up
        MII Polling Interval (ms): 100
        Up Delay (ms): 0
        Down Delay (ms): 0
    
        802.3ad info
        LACP rate: fast
        Min links: 0
        Aggregator selection policy (ad_select): stable
        Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 33
        Partner Key: 1
        Partner Mac Address: 00:00:00:00:00:00
    
        Slave Interface: eth4
        MII Status: up
        Speed: 10000 Mbps
        Duplex: full
        Link Failure Count: 8
        Permanent HW addr: 90:e2:ba:3f:d1:8c
        Aggregator ID: 1
        Slave queue ID: 0
    
        Slave Interface: eth5
        MII Status: up
        Speed: 10000 Mbps
        Duplex: full
        Link Failure Count: 13
        Permanent HW addr: 90:e2:ba:3f:d1:8d
        Aggregator ID: 2
        Slave queue ID: 0
    

    EDIT: Added config from Nexus

        vpc domain 100
          role priority 4000
          system-priority 4000
          peer-keepalive destination 10.141.10.17 source 10.141.10.12
          peer-gateway
          auto-recovery
        interface port-channel15
          description server5
          switchport access vlan 11
          spanning-tree port type edge
          speed 10000
          vpc 15
        interface Ethernet1/7
          description server5 internal eth4
          no cdp enable
          switchport access vlan 11
          channel-group 15
    

    EDIT: Added results from non-VPC port-channel on nexus1 for same server, before and after IP change (changed IP to influence load-balancing algorithm). This is still using the same settings on the server.

          port states (down by means of "shutdown")
            nexus 1 eth1/7        up     up    down   up
            nexus 1 eth1/14      down    up     up    up <= port moved from nexus 2 eth1/7
    
       connectivity (sever at 10.0.11.5, hashing uses Eth1/14)
           client 1 - server      OK     OK     OK   FAIL
           client 2 - server      OK     OK     OK   FAIL
    

    The results after changing the IP is as predicted; not-used interface being brought up causes failures.

          connectivity (sever at 10.0.11.15, hashing uses Eth1/7)
           client 1 - server      OK    FAIL    OK    OK
           client 2 - server      OK    FAIL    OK    OK
    
    • Zoredache
      Zoredache over 10 years
      Do you have any other hosts using Virtual Port Channel that are working? It might be useful if you post your VPC config from the switches. Your Linux config looks valid.
    • Tolli
      Tolli over 10 years
      Not on this switch, no, and all my LAG config so far has been between Cisco and/or Arista devices - never touched this on Linux before. Will add VPC config to original question. What I'll try tomorrow is to isolate further by making this a standard port-channel on only one switch, i.e. no VPC. Then it's just a matter of finding out the proper way to trick the LB algorithm to test properly.
    • Zoredache
      Zoredache over 10 years
      You can, and should edit additional details into your question instead of trying to put them in a comment.
    • Tolli
      Tolli over 10 years
      Just figured - hitting Enter was playing tricks on me. :)
    • Tolli
      Tolli over 10 years
      As seen in the latest Edit, I reproduced with a normal port-channel. While connected to a single switch, host in mode 4, and doing ifdown for one of the interfaces, the Nexus still sees the link as up and includes in the port channel. Now, what I also did was change from mode 4 to 2 (balance-xor). After changing that, I'm not experiencing the failures described in the OP. The ifdown issue described here is still the same, though. For the application in question, mode 2 will probably work just as well. However, I would obviously love to get it working properly with mode 4.
  • Tolli
    Tolli over 10 years
    Using bond-slaves none changes nothing. bond-primary I believe is not relevant when using mode 4/802.3ad AFAICT, only active-backup. What version are you running? This guide shows two variants for Ubuntu, and this thread refers to config changes as well. Also, the guide at backdrift.org shows other syntax for modprobe.d/bonding. I did try mixing all options (primary and not, different bonding syntax, slaves none and not) - no luck.
  • hookenz
    hookenz over 10 years
    Tolli - I'm actually using this setup in mode 4 and it works perfectly. When I first looked at a bonding setup under Ubuntu I was using 10.04. There seemed to be bugs in the networking scripts to support it. This setup was the only one that worked for me at the time and that's why I stuck with it. Have you tried bonding on the same switch rather than across switch?
  • Tolli
    Tolli over 10 years
    Yes, I also tried hooking two ports to the same switch with same results. See the latest addition to the OP. I was also having problems with the scripts in 12.04; the most reliable approach for me when changing parameters in this setup is "ifdown eth4; ifdown eth5; ifdown bond0; rmmod bonding; ifup eth4; ifup eth5; ifup bond0" - fun. As much as I hate not solving things, perhaps leaving this in balance-xor mode might be the results now. I'd like to try on another server in a seperate environment as well though, I have another similar project coming up.
  • hookenz
    hookenz over 10 years
    Do your server happen to have dual 1GBe nic's as well? try it with those or with any server you might have lying around that has 1GBe nics. This will isolate the issue to either the server or the switch. I'm inclined to think it might be a setup issue on your switch. Also does you switch have the latest firmware installed?