bnx2 and e1000e drivers on RHEL 5.3 detects repeated link loss

17,159

This is odd. Since you are experiencing loss on both nics I would suspect that this would rule out a nic specific firmware issue, a kernel driver issue, or a faulty hardware issue (Except with respect to the motherboard). Although the logs you have posted are specific to BNX2. Have you verified that other machines connected to this same switch with the same hardware profile are not exhibiting the same problem? You should try hard coding the nics to 100 mbit/full as well as the switch, and as silly as it sounds check for faulty cabling. Finally, if resources permit why not try hooking up that machine to a third party switch (like a netgear or something equally innocuous). ?

If multiple nodes are experiencing link loss simultaneously I would go as far as to say that you may have a spanning tree error that is consistently casing your switch to fail and re converge. Any more information as to topology would help diagnose the issue.

Share:
17,159

Related videos on Youtube

nickthecook
Author by

nickthecook

Updated on September 17, 2022

Comments

  • nickthecook
    nickthecook over 1 year

    UPDATE: The problem was faulty hardware on the switch. Thanks to all of you for the good debugging suggestions. Correct answer given to MattyB for suggesting using a different switch to see if the problem persisted.

    Hello serverfault,

    I am attempting to debug an issue on several nodes that are repeatedly detecting link loss for 1-2 minutes at a time, when there should be no link loss.

    Servers:
    - HP DL360 G5
    - 1 on-board 2-port Broadcom NetXtreme II BCM5708 Gigabit Ethernet (rev 12) (using bnx2 driver)
    - 1 4-port Intel 82571EB Gigabit Ethernet Controller (Copper) (rev 06) (using e1000e driver)

    Facts:
    - On all nodes, both Broadcom ports and one Intel port are connected to the same switch.
    - UPDATE: Link loss is detected on ports on both NICs, Broadcom and Intel
    - All ports are at Gb/s speed, except the Intel ports on two of the nodes, which are at 100Mb/s speed. All speeds set using auto-negotiation.
    - All nodes were recently upgraded from RHEL 5.0 to RHEL 5.3.

    I am currently attempting to get access to the switch to force Gbps/full duplex links. Is there anything other than that that could be done to diagnose or fix this issue? What further information would be useful?

    EDIT: I've run tcpdump on one of the affected interfaces, and all I can see are LLDP packets, and a single IGMP Group Membership Query. I have also set the switch to force all ports to 1000Mbps links, full duplex. Does this indicate that the problem is internal to the node, and not caused by any settings on the switch?

    ====== Log messages ======
    Oct 29 11:30:36 db1 kernel: bnx2: eth1 NIC Copper Link is Down
    Oct 29 11:30:37 db1 kernel: bnx2: eth0 NIC Copper Link is Down
    Oct 29 11:30:39 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
    Oct 29 11:30:39 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
    Oct 29 11:31:08 db1 kernel: bnx2: eth0 NIC Copper Link is Down
    Oct 29 11:31:10 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
    Oct 29 12:56:41 db1 kernel: bnx2: eth1 NIC Copper Link is Down
    Oct 29 12:56:41 db1 kernel: bnx2: eth0 NIC Copper Link is Down
    Oct 29 12:58:34 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
    Oct 29 12:58:34 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
    Oct 29 12:59:02 db1 kernel: bnx2: eth1 NIC Copper Link is Down
    Oct 29 12:59:03 db1 kernel: bnx2: eth0 NIC Copper Link is Down
    Oct 29 12:59:05 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
    Oct 29 12:59:05 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
    Oct 29 12:59:34 db1 kernel: bnx2: eth0 NIC Copper Link is Down
    Oct 29 12:59:35 db1 kernel: bnx2: eth1 NIC Copper Link is Down
    Oct 29 12:59:37 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON

    ====== ethtool output for all connected interfaces on one node ======
    [root@db1 ~]# ethtool eth0
    Settings for eth0:
    Supported ports: [ TP ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: g
    Wake-on: g
    Link detected: yes
    [root@db1 ~]# ethtool eth1
    Settings for eth1:
    Supported ports: [ TP ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: g
    Wake-on: g
    Link detected: yes
    [root@db1 ~]# ethtool eth2
    Settings for eth2:
    Supported ports: [ TP ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 100Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: pumbag
    Wake-on: d
    Current message level: 0x00000001 (1)
    Link detected: yes

  • David Corsalini
    David Corsalini over 14 years
    I know there are some issues with the bnx2 driver. If your version of RHEL is purchased, why not contact Red Hat support or the server vendor support?
  • James
    James over 14 years
    TSO and TOE are totally different things. TSO good, TOE bad.
  • David Corsalini
    David Corsalini over 14 years
    In my experience, especially where network traffic is involved (like heavily loaded checkpoint FW clusters), TSO is as bad as ToE
  • nickthecook
    nickthecook over 14 years
    Thanks! The problem was indeed the switch. Apparently it eventually rebooted spontaneously, tipping off the guys on-site to the fact that it was having issues. I had been asking them to look at the switch for weeks, but they swore up and down that it was not a switch problem. :P