RHEL 6.4: Mode 1 channel bonding not failing over

15,120

Solution 1

READ. YOUR. CONFIGS.

And when that fails...

READ. ALL. OUTPUTS.

Do you see what's in ifcfg-bond0? No, do you understand what's in ifcfg-bond0?
What in the world of slippery penguins is miimmon=100?
Oh I'm sorry, did you mean miimon=100?

Yeah, I think you meant miimon and not miimmon.

Also, a big giveaway is that when you restart your network service you see this:

service network restart
Shutting down interface bond0:                             [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface bond0:  ./network-functions: line 446: /sys/class/net/bond0/bonding/miimmon: No such file or directory
./network-functions: line 446: /sys/class/net/bond0/bonding/miimmon: No such file or directory
                                                           [  OK  ]

Pay careful attention to everything you type and when you make your inevitable typing mistake, pay careful attention to every output that you see.

You are a bad person and you should feel bad.

Solution 2

Add the following bonding option downdelay=xxxx in milisec that fails a eth after it has been detected as failed and set the primary slave to the remaining. If this parameter is not in the bonding_opt, the bond detecte the failure (because you include miimom=yyyy) but it never fails the eth0. You can see this happening by looking at the /proc/net/bonding/bondX file.

Anyway, with RHEL 6.3 (almost the same version as yours) we are having several other problems with bonding related to failing back, duplicated mac addr seen from the switch.

good luck.

Solution 3

Try specifying one of the NICS as the primary slave.

DEVICE=bond0
IPADDR=192.168.11.222
GATEWAY=192.168.11.1
NETMASK=255.255.255.0
DNS1=192.168.11.1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=1 miimmon=100 primary=eth0"

More documentation from RH:

primary= Specifies the interface name, such as eth0, of the primary device. The primary device is the first of the bonding interfaces to be used and is not abandoned unless it fails. This setting is particularly useful when one NIC in the bonding interface is faster and, therefore, able to handle a bigger load. This setting is only valid when the bonding interface is in active-backup mode. Refer to /usr/share/doc/kernel-doc-/Documentation/networking/bonding.txt for more information.

Share:
15,120

Related videos on Youtube

Wesley
Author by

Wesley

I solve problems with Linux and Windows, wired and wireless networks, languages of all types, and stacks of all depths. Contact me if you think I can help you with something. I'm open to new opportunities. Email: [email protected] LinkedIn: https://www.linkedin.com/in/wesleydavid/ My Gravatar is from Troy Snow's photography: http://www.flickr.com/photos/troysnow/

Updated on September 18, 2022

Comments

  • Wesley
    Wesley over 1 year

    I'm running RHEL 6.4, kernel-2.6.32-358.el6.i686, on an HP ML 350 G5 with two onboard Broadcom NetXtreme II BCM5708 1000Base-T NICs. My goal is to channel bond the two interfaces into a mode=1 failover pair.

    My problem is that in spite of all evidence that the bond is set up and accepted, pulling the cable out of the primary NIC causes all communication to cease.

    ifcfg-etho and ifcfg-eth1

    First, ifcfg-eth0:

    DEVICE=eth0
    HWADDR=00:22:64:F8:EF:60
    TYPE=Ethernet
    UUID=99ea681d-831b-42a7-81be-02f71d1f7aa0
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=none
    MASTER=bond0
    SLAVE=yes
    

    Next, ifcfg-eth1:

    DEVICE=eth1
    HWADDR=00:22:64:F8:EF:62
    TYPE=Ethernet
    UUID=92d46872-eb4a-4eef-bea5-825e914a5ad6
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=none
    MASTER=bond0
    SLAVE=yes
    

    ifcfg-bond0

    My bond's config file:

    DEVICE=bond0
    IPADDR=192.168.11.222
    GATEWAY=192.168.11.1
    NETMASK=255.255.255.0
    DNS1=192.168.11.1
    ONBOOT=yes
    BOOTPROTO=none
    USERCTL=no
    BONDING_OPTS="mode=1 miimmon=100"
    

    /etc/modprobe.d/bonding.conf

    I have an /etc/modprobe.d/bonding.conf file that is populated thusly:

    alias bond0 bonding
    

    ip addr output

    The bond is up and I can access the server's public services through the bond's IP address:

    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
        link/ether 00:22:64:f8:ef:60 brd ff:ff:ff:ff:ff:ff
    3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
        link/ether 00:22:64:f8:ef:60 brd ff:ff:ff:ff:ff:ff
    4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
        link/ether 00:22:64:f8:ef:60 brd ff:ff:ff:ff:ff:ff
        inet 192.168.11.222/24 brd 192.168.11.255 scope global bond0
        inet6 fe80::222:64ff:fef8:ef60/64 scope link 
           valid_lft forever preferred_lft forever
    

    Bonding Kernel Module

    ...is loaded:

    # cat /proc/modules | grep bond
    bonding 111135 0 - Live 0xf9cdc000
    

    /sys/class/net

    The /sys/class/net filesystem shows good things:

    cat /sys/class/net/bonding_masters 
    bond0
    cat /sys/class/net/bond0/operstate 
    up
    cat /sys/class/net/bond0/slave_eth0/operstate 
    up
    cat /sys/class/net/bond0/slave_eth1/operstate 
    up
    cat /sys/class/net/bond0/type 
    1
    

    /var/log/messages

    Nothing of concern appears in the log file. In fact, everything looks rather happy.

    Jun 15 15:47:28 rhsandbox2 kernel: Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: setting mode to active-backup (1).
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: setting mode to active-backup (1).
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: setting mode to active-backup (1).
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: setting mode to active-backup (1).
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: Adding slave eth0.
    Jun 15 15:47:28 rhsandbox2 kernel: bnx2 0000:03:00.0: eth0: using MSI
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: making interface eth0 the new active one.
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: first active interface up!
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: enslaving eth0 as an active interface with an up link.
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: Adding slave eth1.
    Jun 15 15:47:28 rhsandbox2 kernel: bnx2 0000:05:00.0: eth1: using MSI
    Jun 15 15:47:28 rhsandbox2 kernel: bonding: bond0: enslaving eth1 as a backup interface with an up link.
    Jun 15 15:47:28 rhsandbox2 kernel: 8021q: adding VLAN 0 to HW filter on device bond0
    Jun 15 15:47:28 rhsandbox2 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex
    Jun 15 15:47:28 rhsandbox2 kernel: bnx2 0000:05:00.0: eth1: NIC Copper Link is Up, 1000 Mbps full duplex
    

    So what's the problem?!

    Yanking the network cable from eth0 causes all communication to go dark. What could the problem be and what further steps should I take to troubleshoot this?

    EDIT:

    Further Troubleshooting:

    The network is a single subnet, single VLAN provided by a ProCurve 1800-8G switch. I have added primary=eth0 to ifcfg-bond0 and restart networking services, but that has not changed any behavior. I checked /sys/class/net/bond0/bonding/primary both before and after adding primary=eth1 and it has a null value, which I'm not sure is good or bad.

    Tailing /var/log/messages when eth1 has its cable removed shows nothing more than:

    Jun 15 16:51:16 rhsandbox2 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down
    Jun 15 16:51:24 rhsandbox2 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex
    

    I added use_carrier=0 to ifcfg-bond0's BONDING_OPTS section to enables the use of MII/ETHTOOL ioctls. After restarting the network service, there was no change in symptoms. Pulling the cable from eth0 causes all network communication to cease. Once again, no errors in /var/log/messages save for the notification that the link on that port went down.

    • Andy Shinn
      Andy Shinn almost 11 years
      Can you add some more information such as make/model switch connected to, any VLAN setup on switch, bond slave states and /var/log/messages after the cable to eth0 is unplugged?
    • Wesley
      Wesley almost 11 years
      @AndyShinn The switch that it is directly connected to is a ProCurve 1800-8G. There are no VLANs on the network. It's a simple single subnet, single VLAN network.
    • Wesley
      Wesley almost 11 years
      @AndyShinn Ah, and also the bond slave states are both reported as up. Tailing /var/log/messages at the time of eth0 being unplugged only shows that the copper link has been unplugged. No messages from the bonding module.
  • Wesley
    Wesley almost 11 years
    Before I edited ifcfg-bond0 I checked /sys/class/net/bond0/bonding/primary and the response is blank. I added primary=eth0 to ifcfg-bond0 and restart the network service. There is no change in the symptom and no change to /sys/class/net/bond0/bonding/primary Thanks for the suggestion though!
  • dmourati
    dmourati almost 11 years
    try adding use_carrier=0 ? see above RH doc for details
  • Wesley
    Wesley almost 11 years
    Done - added the information to the question. There was no change in behavior, but that's a good option to know about.
  • voretaq7
    voretaq7 almost 11 years
    BAD CAT! sprays with hose