Link Aggregation (Bonding) for bandwidth does not work when Link Aggregation Groups (LAG) set on smart switch

5,510

Solution 1

As I mentioned in my final edit, the reason that I am not able to get higher bandwidth using round-robin bonding when the switch has Link Aggregation Groups set is that switch Link Aggregation Groups do not do round-robin striping of packets on a single TCP connection, whereas the linux bonding does. This is mentioned in the kernel.org docs:

https://www.kernel.org/doc/Documentation/networking/bonding.txt

12.1.1 MT Bonding Mode Selection for Single Switch Topology

This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your needs. The trade offs for each mode are detailed below:

balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.

It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3. But keep in mind TCP stack is able to automatically increase this when it detects reorders.

Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level of reordering depends upon a variety of factors, including the networking interfaces, the switch, and the topology of the configuration. Speaking in general terms, higher speed network cards produce more reordering (due to factors such as packet coalescing), and a "many to many" topology will reorder at a higher rate than a "many slow to one fast" configuration.

Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater than one interface's worth of bandwidth.

If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added to the bond.

This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."

The last note about having ports configured for "trunking" is odd, since when I make the ports in a LAG, all outgoing Tx from the switch go down a single port. Removing the LAG makes it send and receive half and half on each port, but results in many resends, I assume due to out-of-order packets. However, I still get an increase in bandwidth.

Solution 2

There are several points in your text where I think I can clarify your ideas a bit:

  • The fact you mention so casually you are changing between normal and Jumbo frames worries me. You cannot mix in the same network/netblock Jumbo frames and regular frames. Either that entire network transmits Jumbo frames or normal frames, and I mean all interfaces in that network.
  • If you have an aggregated link, you have to have it on both sides, both in the switch and system side; other nasty things can and will happen; with luck, in the best scenario the switch will detect a loop and will just disable one of the links;
  • If you want speed, you want link aggregation, not probably load-balancing;
  • a single UDP and mainly TCP connection won't scale much after a certain threshold; you need to test multiple simultaneous connections. iperf lets you do that;
  • at those speeds, you might be hitting other limiting factors when dealing with link aggregation on the two links vs one, notably interrupt handling.

As for the switch, I do not know much TP-LINK and it is off-topic here to get into switch topics. I will just leave around the idea that if you are working professionally, that you would be better using more top-tier gear for better results into more esoteric functionalities or high performance networks.

See related how to know if my servers should use jumbo frames ( MTU ) and related Can jumbo frames - MTU=9000 be set on VM machines?

As for mixing 9000 and 1500 in the same VLAN/group of interfaces:

If the server transmits a packet to the client that is greater than 1500 bytes in the given configuration, it will simply be dropped and not processed, which is different to fragmentation

From serverfault

Make sure that your NICs exist in separate netblocks when doing this. If you use Linux, packets are routed via the first NIC in the system in the netblock, so, even though eth1 has an MTU of 9000, it could end up routing those packets through eth0.

We set up a separate VLAN to our storage network and had to set up a separate netblock on eth1 to avoid that behavior. Increasing the MTU to 9000 easily increased throughput as that particular system deals with streaming a number of rather large files.

Share:
5,510

Related videos on Youtube

rveale
Author by

rveale

I'm an academic (research professor). Best description of my fields are Computational Neuroscience and NeuroRobotics.

Updated on September 18, 2022

Comments

  • rveale
    rveale almost 2 years

    My question is: why does setting Link Aggregation Groups on the smart switch lower the bandwidth between two machines?

    I have finally achieved higher throughput (bandwidth) between two machines (servers running ubuntu 18.04 server) connected via 2 bonded 10G CAT7 cables via a TP-LINK T1700X-16TS smart switch. The cables are connected to single intel X550-T2 NIC in each machine (which has 2 RJ45 ports on each card), which is plugged into a PCI-E x8.

    The first thing I did was create in the switch's configuration to create static LAG groups containing the two ports that each machine was connected to. This ended up being my first mistake.

    On each box, created a bond which contained the two ports on the intel X550-T2 card. I am using netplan (and networkd). E.g.:

    network:
     ethernets:
         ens11f0:
             dhcp4: no
             optional: true
         ens11f1:
             dhcp4: no
             optional: true
     bonds:
             bond0:
                 mtu: 9000 #1500
                 dhcp4: no
                 interfaces: [ens11f0,ens11f1]
                 addresses: [192.168.0.10/24]
                 parameters:
                     mode: balance-rr
                     transmit-hash-policy: layer3+4 #REV: only good for xor ?
                     mii-monitor-interval: 1
                     packets-per-slave: 1
    

    Note the 9000 byte MTU (for jumbo packets) and balance-rr.

    Given these settings, I can now use iperf (iperf3) to test bandwidth between the machines:

    iperf3 -s (on machine1)
    
    iperf3 -c machine1 (on machine2)
    

    I get something like 9.9 Gbits per second (very close to theoretical max of single 10G connection)

    Something is wrong though. I'm using round-robin, and I have two 10G cables between the machines (theoretically). I should be able to get 20G bandwidth, right?

    Wrong.

    Weirdly, I next deleted the LAG groups from the smart switch. Now, on the linux side I have bonded interfaces, but to the switch, there are no bonds (no LAG).

    Now I run iperf3 again:

    [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
    [  4]   0.00-1.00   sec  1.77 GBytes  15.2 Gbits/sec  540    952 KBytes       
    [  4]   1.00-2.00   sec  1.79 GBytes  15.4 Gbits/sec  758    865 KBytes       
    [  4]   2.00-3.00   sec  1.84 GBytes  15.8 Gbits/sec  736    454 KBytes       
    [  4]   3.00-4.00   sec  1.82 GBytes  15.7 Gbits/sec  782    507 KBytes       
    [  4]   4.00-5.00   sec  1.82 GBytes  15.6 Gbits/sec  582   1.19 MBytes       
    [  4]   5.00-6.00   sec  1.79 GBytes  15.4 Gbits/sec  773    708 KBytes       
    [  4]   6.00-7.00   sec  1.84 GBytes  15.8 Gbits/sec  667   1.23 MBytes       
    [  4]   7.00-8.00   sec  1.77 GBytes  15.2 Gbits/sec  563    585 KBytes       
    [  4]   8.00-9.00   sec  1.75 GBytes  15.0 Gbits/sec  407    839 KBytes       
    [  4]   9.00-10.00  sec  1.75 GBytes  15.0 Gbits/sec  438    786 KBytes       
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bandwidth       Retr
    [  4]   0.00-10.00  sec  17.9 GBytes  15.4 Gbits/sec  6246             sender
    [  4]   0.00-10.00  sec  17.9 GBytes  15.4 Gbits/sec                  receiver
    

    Huh, now I get 15.4 Gbits/sec (sometimes up to 16.0).

    The resends worry me (I was getting zero when I had the LAGs set up), but now I am getting at least some advantage.

    Note, if I disable jumbo packets or set MTU to 1500, I get only about 4Gbps to 5Gbps.

    Does anyone know why setting the Link Aggregation Groups on the smart switch (which I thought should help), instead limits the performance? On the other hand, not setting them (heck I could have saved my money and bought an unmanaged switch!) lets me send more packets which are routed correctly?

    What is the point of the switch's LAG groups? Am I doing something wrong somewhere? I would like to increase bandwidth even more than 16Gbps if possible.

    edit

    Copying from my comment below (update):

    I verified real application 11Gbps (1.25 GiB/sec) over my bonded connection, using nc (netcat) to copy a 60 GB file from a ramdisk on one system to another. I verified file integrity using hash, it is the same file on both sides.

    Using only one of the 10G ports at a time (or bonded using balance-xor etc.), I get 1.15 GiB/sec (about 9.9 Gbps). Both iperf and nc use a TCP connection by default. Copying it to the local machine (via loopback) gets a speed of 1.5 GiB/sec. Looking at port usage on the switch, I see roughly equal usage on the sender Tx side (70% in the case of iperf, ~55% in the case of the nc file copy), and equal usage between the 2 bonded ports on the Rx side.

    So, in the current setup (balance-rr, MTU 9000, no LAG groups defined on the switch), I can achieve more than 10Gbps, but only barely.

    Oddly enough, defining LAG groups on the switch now breaks everything (iperf and file transfers now send 0 bytes). Probably just takes time for it to figure out new switching situation, but I re-ran many times and re-booted / reset the switch several times. So, I'm not sure why that is.

    edit 2

    I actually found mention of striping and balance-rr allowing higher than single port bandwidth in the kernel.org docs.

    https://www.kernel.org/doc/Documentation/networking/bonding.txt

    Specifically

    12.1.1 MT Bonding Mode Selection for Single Switch Topology

    This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your needs. The trade offs for each mode are detailed below:

    balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.

    It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3. But keep in mind TCP stack is able to automatically increase this when it detects reorders.

    Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level of reordering depends upon a variety of factors, including the networking interfaces, the switch, and the topology of the configuration. Speaking in general terms, higher speed network cards produce more reordering (due to factors such as packet coalescing), and a "many to many" topology will reorder at a higher rate than a "many slow to one fast" configuration.

    Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater than one interface's worth of bandwidth.

    If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added to the bond.

    This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."

    So, theoretically, balance-rr will allow me to stripe single TCP connection's packets. But, they may arrive out of order, etc.

    However, it mentions that most switches do not support the striping. Which seems to be the case with my switch. Watching traffic during a real file transfer, Rx packets (i.e. sending_machine->switch) arrive evenly distributed over both bonded ports. However, Tx packets (switch->receiving_machine) only go out over one of the ports (and achieve 90% or more saturation).

    By not explicitly setting up the Link Aggregation groups in the switch, I'm able to achieve higher throughput, but I'm not sure how the receiving machine is telling the switch to send one down one port, next down another, etc.

    Conclusion:

    The Switch Link Aggregation Groups do not support round-robin (i.e. port striping) for sending of packets. So, ignoring them allows me to get high throughput, but actual writing to memory (ramdisk) seems to hit a memory, CPU processing, or packet reordering saturation point.

    I tried increasing/decreasing reordering, as well as read and write memory buffers for TCP using sysctl, with no change in performance. E.g.

    sudo sysctl -w net.ipv4.tcp_reordering=50
    sudo sysctl -w net.ipv4.tcp_max_reordering=1000
    
    sudo sysctl -w net.core.rmem_default=800000000
    sudo sysctl -w net.core.wmem_default=800000000
    sudo sysctl -w net.core.rmem_max=800000000
    sudo sysctl -w net.core.wmem_max=800000000
    
    sudo sysctl -w net.ipv4.tcp_rmem=800000000
    sudo sysctl -w net.ipv4.tcp_wmem=800000000
    

    The only change in performance I notice is between machines with:
    1) stronger processor (slightly higher single core clock, doesn't care about L3 cache)
    2) faster memory? (or fewer DIMM for same amount of memory)

    This seems to imply that I am hitting bus, CPU or memory read/write. A simple "copy" locally within a ramdisk (e.g. dd if=file1 of=file2 bs=1M) results in optimal speed of roughly 2.3GiB/sec on 2.6 Ghz, 2.2GiB/sec on 2.4 Ghz, and 2.0GiB/sec on 2.2 Ghz. The second one furthermore has slower memory, but it doesn't seem to matter.

    All TCP copies TO the 2.6 Ghz ramdisk from the slower machines go at 1.15 GiB/s, from 2.4 Ghz go at 1.30 GiB/s, from fastest machine to middle machine go at 1.02 GiB/s, to slower machine (with faster memory) at 1.03 GiB/s, etc.

    Biggest effect seems to be the single-core CPU and the memory clock on the receiving end. I have not compared BIOS settings, but all are running the same bios versions and use same motherboards, eth cards, etc.. Rearranging CAT7 cables or switch ports does not seem to have an effect.

    I did find

    http://louwrentius.com/achieving-340-mbs-network-file-transfers-using-linux-bonding.html

    Who does this with four 1GbE connections. I tried setting up separate VLAN, but it did not work (did not increase speed).

    Finally, sending to self using the same method seems to invoke a 0.3 GiB - 0.45 GiB/sec penalty. So, my observed values are not that much lower than the "theoretical" max for this method.

    edit 3 (adding more info for posterity)

    Even with balance-rr and LAG set on switch, I just realized that despite seeing 9.9 Gbps, retries in balance-rr are actually higher than in the case without the LAG! 2500 per second average with the groups, 1000 average without!

    However, with groups set, I get average real file transfer speed memory to memory of 1.15 GiB/s (9.9 Gbps). If I only plug a single port in per machine, I see the same speed (1.15 GiB/s), and very few retries. If I switch the mode to balance-xor, I get 1.15 GiB/s (9.9 Gbps), and no resends. So, balance-rr mode is trying to stripe on the output to switch side of things, and that is causing a lot of out-of-order packets I guess.

    Since my max (real-world) performance for memory-to-memory transfers is similar or higher using switch LAG and balance-xor, while having less resends (congestion), I am using that. However, since the eventual goal is NFS and MPI send, I will need to somehow find a way to saturate and measure network speed in those situations, which may depend upon how MPI connections are implemented...

    Final Edit

    I moved back to using balance-rr (with no LAG set on the switch side), since XOR will always hash to the same port for the same two peers. So it will only ever use one of the ports. Using balance-rr, if I run 2 or more (ram to ram) file transfers simultaneously, I can get net 18-19 Gbps, quite close to theoretic max of 20 Gbps.

    Final Final Edit (after using for a few months)

    I had to set the LAG groups in the switch, because I was getting errors where I could no longer SSH into machines, I assume because of packets getting confused where they were supposed to go with some addressing stuff. Now, I get only the maximum per connection of 10GBPS, but it is stable.

  • rveale
    rveale almost 6 years
    Thank you for the comments, Rui. Clarifications: 1) I don't think it's true that all packets sent must be the same size. I'm pretty sure if I have a switchX with jumbo frames enabled, and machines A-D: A - switchX - B (A and B set to 9000 MTU) C - switchX - D (C and D set to 1500 MTU) It will work just fine. C and D will communicate with 1500 byte MTU, and A and B will do 9000 byte packets. 2) Right now, I have it working with only one side. I have seen several other reports where that is the case: (45drives.blogspot.com/2015/07/…)
  • rveale
    rveale almost 6 years
    Continued: 3) I am doing link aggregation (bonding). By balancing packets sent/received in a round-robin fashion between the two bonded ports, I can theoretically get double the bandwidth. 4) I'm not sure, I've seen people get 40G in a single stream. Why would it stop scaling unless you hit read/write limitations, as you mention in (5). E.g. copying memory-to-memory between two machines. Regarding the switch I do research, it's a compute cluster. All I need is fast links between nodes. Being academic, budget is a limit, so I bought the best switch I could find with the functionalities needed.
  • rveale
    rveale almost 6 years
    I would appreciate if you have more information about that. Right now, I have 2 machines set to 1500 and 2 set to 9000 MTU, and I can communicate between all of them. Furthermore, the switch traffic monitor will report the number of normal-sized and jumbo-sized frames processed on each port of the switch. Of course sending a jumbo packet to a machine with MTU set lower will cause it to drop it or break it up at the receiving end I guess...
  • rveale
    rveale almost 6 years
    Hm thank you for those links. At any rate, all the systems on the network have MTU set to 9000 now since it seems to work best for bandwidth, so this is not an issue.
  • Rui F Ribeiro
    Rui F Ribeiro almost 6 years
    I doubt you have seen people doing 40Gbps in a single TCP connection...known limitations of the protocol/implementation. we might be calling different names to different things here.
  • rveale
    rveale almost 6 years
    I verified 11Gbps (1.25 GiB/sec) over my bonded connection, using nc (netcat) to copy a 60 GB file from a ramdisk on one system to another. Using only one of the 10G ports at a time (or bonded using balance-xor etc.), I get 1.15 GiB/sec (about 9.9 Gbps). Both iperf and nc use a TCP connection by default. Copying it to the local machine (via loopback) gets a speed of 1.5 GiB/sec. Looking at port usage on the switch, I see roughly equal usage on the sender Tx side (70% in the case of iperf, ~55% in the case of the nc file copy), and equal usage between the 2 bonded ports on the Rx side.