Why does my gigabit bond not deliver at least 150 MB/s throughput?

21,560

I had a similar problem trying to raise the speed of a drbd synchronization over two gigabit links some time ago. In the end I managed to get about 150MB/sec synch speed. These were the settings that I applied on both nodes:

ifconfig bond0 mtu 9000
ifconfig bond0 txqueuelen 10000
echo 3000 > /proc/sys/net/core/netdev_max_backlog

You could also try to enable interrupt coalescence if you don't already have for your network cards (with ethtool --coalesce)

Share:
21,560

Related videos on Youtube

Nils
Author by

Nils

I studied information science and did computers from the old Commodore VC-20 onwards. I started during study as pc-admin for DOS, Windows, WfW and all related software and hardware stuff. Later I switched over to servers, starting with linux-samba and NT 3.5/4.0. My first job made me a Solaris admin in a huge company with over 500 solaris servers. There I got every day a new interesting problem that I have never encountered before. My next job brought me into project management and later back to system administration - mainly Linux. It was frustrating to manage a project when you knew that doing the admin`s job yourselv would have the current step finished in less than 30 minutes...

Updated on September 18, 2022

Comments

  • Nils
    Nils almost 2 years

    I directly connected two PowerEdge 6950 crossover (using straight lines) on two different PCIe-adapters.

    I get a gigabit link on each of these lines (1000 MBit, full duplex, flow contol in both directions).

    Now I am trying to bond these interfaces into bond0 using the rr-algorithm on both sides (I want to get 2000 MBit for a single IP session).

    When I tested the throughput by transferring /dev/zero to /dev/null using dd bs=1M and netcat in tcp mode I get a throughput of 70 MB/s - not - as expected more than 150MB/s.

    When I use the single lines I get about 98 MB/s on each line, if I used a different direction for each line. When I use the single lines I get 70 MB/s and 90 MB/s on the line, if traffic goes into the "same" direction.

    After reading through the bonding-readme (/usr/src/linux/Documentation/networking/bonding.txt) I found the following section to be useful: (13.1.1 MT Bonding Mode Selection for Single Switch Topology)

    balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping often results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.

        It is possible to adjust TCP/IP's congestion limits by
        altering the net.ipv4.tcp_reordering sysctl parameter. The
        usual default value is 3, and the maximum useful value is 127.
        For a four interface balance-rr bond, expect that a single
        TCP/IP stream will utilize no more than approximately 2.3
        interface's worth of throughput, even after adjusting
        tcp_reordering.
    
        Note that this out of order delivery occurs when both the
        sending and receiving systems are utilizing a multiple
        interface bond.  Consider a configuration in which a
        balance-rr bond feeds into a single higher capacity network
        channel (e.g., multiple 100Mb/sec ethernets feeding a single
        gigabit ethernet via an etherchannel capable switch).  In this
        configuration, traffic sent from the multiple 100Mb devices to
        a destination connected to the gigabit device will not see
        packets out of order.  However, traffic sent from the gigabit
        device to the multiple 100Mb devices may or may not see
        traffic out of order, depending upon the balance policy of the
        switch.  Many switches do not support any modes that stripe
        traffic (instead choosing a port based upon IP or MAC level
        addresses); for those devices, traffic flowing from the
        gigabit device to the many 100Mb devices will only utilize one
        interface.
    

    Now I changed that parameter on both connected servers on all lines (4) from 3 to 127.

    After bonding again I get about 100 MB/s but still not more than that.

    Any ideas why?

    Update: Hardware details from lspci -v:

    24:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
            Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter
            Flags: bus master, fast devsel, latency 0, IRQ 24
            Memory at dfe80000 (32-bit, non-prefetchable) [size=128K]
            Memory at dfea0000 (32-bit, non-prefetchable) [size=128K]
            I/O ports at dcc0 [size=32]
            Capabilities: [c8] Power Management version 2
            Capabilities: [d0] MSI: Mask- 64bit+ Count=1/1 Enable-
            Capabilities: [e0] Express Endpoint, MSI 00
            Kernel driver in use: e1000
            Kernel modules: e1000
    

    Update final results:

    8589934592 bytes (8.6 GB) copied, 35.8489 seconds, 240 MB/s

    I changed a lot of tcp/ip and low-level-driver options. This includes enlargement of the network buffers. This is why dd now shows numbers greater than 200 MB/s: dd terminates while there is still output waiting to be transferred (in send buffers).

    Update 2011-08-05: Settings that were changed to achive the goal (/etc/sysctl.conf):

    # See http://www-didc.lbl.gov/TCP-tuning/linux.html
    # raise TCP max buffer size to 16 MB. default: 131071
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    # raise autotuninmg TCP buffer limits
    # min, default and max number of bytes to use
    # Defaults:
    #net.ipv4.tcp_rmem = 4096 87380 174760
    #net.ipv4.tcp_wmem = 4096 16384 131072
    # Tuning:
    net.ipv4.tcp_rmem = 4096 87380 16777216
    net.ipv4.tcp_wmem = 4096 65536 16777216
    # Default: Backlog 300
    net.core.netdev_max_backlog = 2500
    #
    # Oracle-DB settings:
    fs.file-max = 6815744
    fs.aio-max-nr = 1048576
    net.ipv4.ip_local_port_range = 9000 65500
    kernel.shmmax = 2147659776
    kernel.sem = 1250 256000 100 1024
    net.core.rmem_default = 262144
    net.core.wmem_default = 262144
    #
    # Tuning for network-bonding according to bonding.txt:
    net.ipv4.tcp_reordering=127
    

    Special settings for the bond-device (SLES: /etc/sysconfig/network/ifcfg-bond0):

    MTU='9216'
    LINK_OPTIONS='txqueuelen 10000'
    

    Note that setting the biggest possible MTU was the key to the solution.

    Tuning of the rx/tx buffers of the involved network cards:

    /usr/sbin/ethtool -G eth2 rx 2048 tx 2048
    /usr/sbin/ethtool -G eth4 rx 2048 tx 2048
    
    • Zoredache
      Zoredache almost 13 years
      Have you checked /proc/net/bonding/bond0 to verify that you are actually getting set into balance-rr? Did you see the note n that documentation you pasted about a 4 interface bond only giving you 2.3 interfaces worth of throughput? Given that note, it seems highly unlikely that you will get close to the 2000mb/s you want.
    • Kedare
      Kedare almost 13 years
      I'm not sure that LACP/Bonding can divide a single TCP session on multiple physical link.
    • user2751502
      user2751502 almost 13 years
      @Kedare, this isn't LACP, this is the Linux bonding modules own round-robin packet scheduler which can utilize multiple links for a single TCP session.
    • Steve Townsend
      Steve Townsend almost 13 years
      A better way of testing throughput on a link is to use nuttcp. Test single connections or multiple connections easily.
  • Nils
    Nils almost 13 years
    There is no network device involved. These are direct crossover cables.
  • Chopper3
    Chopper3 almost 13 years
    Ah, so you're out of luck for another entirely different reason then; LACP/Etherchannel trunks such as this rely on the variance in the first (and where appropriate second and third) least significant bit of the destination MAC to define which trunk member is used to communicate over to that MAC. Given you'll only have one MAC for the trunk on each end they'll never use more than one link then either.
  • the-wabbit
    the-wabbit almost 13 years
    he's not using etherchannel / 802.3ad, he is using balance-rr, which, to be exact, does not even require any switch support.
  • Nils
    Nils almost 13 years
    @Chopper3: So the MAC-issue should not appear in RR in your opinion?
  • Chopper3
    Chopper3 almost 13 years
    Don't know that well enough to comment, kinda wished you'd mentioned that stuff earlier but never mind.
  • Nils
    Nils almost 13 years
    I already got 160 MB/s - using the concurrent single lines. But this drops to 100 MB/s upon bonding. On each single line I get nearly 100 MB/s so the cables do not seem to be the problem either.
  • SpacemanSpiff
    SpacemanSpiff almost 13 years
    This should reduce CPU load right? I wonder what the CPU is doing during these tests.
  • user48838
    user48838 almost 13 years
    There does not appear to be any PCIe support for the PowerEdge 6950. Anything "different" with its PCI bus? Notwithstanding, you might look up the IO bus specifications for the PowerEdge 6950.
  • Julien Vehent
    Julien Vehent almost 13 years
    with an MTU of 9000 instead of 1500, you reduce the number of tcp data packets that you need to transfert the same amount of data (the payload is bigger). So you do less packet processing, on both side and both ways, and send more data.
  • Nils
    Nils almost 13 years
    This looks like it is worth a try. The CPUs are pretty idle during transfer. But I still have the feeling that one physical link is waiting for an ACK before the kernel sends the next packet on the other physical link.
  • Julien Vehent
    Julien Vehent almost 13 years
    I'm curious about the result too. Also, try to bind each NIC to a CPU core. A recent kernel should handle that properly, but I'm not sure how it would work with bonding. The idea is to avoid switching from a l2 cache to another for every packet.
  • user842313
    user842313 almost 13 years
    I don't know. It was not needed in my case. Setting those parameters was enough. But I guess if you set it it won't hurt. Did the transfer rate improve?
  • Nils
    Nils almost 13 years
    I currently can`t test that, but it will most propably. Your hint about "coalescence" propably hits the mark. I found an interesting article (in german) about "High Speed Ethernet" settings. The jumbo frames go into the same direction - it is all about reducing the number of pci-interrupts needed to transfer the workload.
  • user842313
    user842313 almost 13 years
    If you are thinking at some hw bottleneck like interrupts limit, a tool like collectd will definitely help, although it would require a bit of setup. See, for example, this graph
  • Nils
    Nils almost 13 years
    CPU load is not a problem. All offload options are turned on...
  • Nils
    Nils almost 13 years
    There are no network devices involved in this special case. (direct crossover lines). This is also the only (real) case where you can use the RR algorithm to get the load shared across all lines for a single session.
  • Nils
    Nils almost 13 years
    I updated the question with the output of lspci. This was not the bottleneck. I get my 200 MB/s now.