bnx2 and e1000e drivers on RHEL 5.3 detects repeated link loss

linux networking rhel5 redhat

17,159

This is odd. Since you are experiencing loss on both nics I would suspect that this would rule out a nic specific firmware issue, a kernel driver issue, or a faulty hardware issue (Except with respect to the motherboard). Although the logs you have posted are specific to BNX2. Have you verified that other machines connected to this same switch with the same hardware profile are not exhibiting the same problem? You should try hard coding the nics to 100 mbit/full as well as the switch, and as silly as it sounds check for faulty cabling. Finally, if resources permit why not try hooking up that machine to a third party switch (like a netgear or something equally innocuous). ?

If multiple nodes are experiencing link loss simultaneously I would go as far as to say that you may have a spanning tree error that is consistently casing your switch to fail and re converge. Any more information as to topology would help diagnose the issue.

17,159

nickthecook

Updated on September 17, 2022

Comments

nickthecook over 1 year

UPDATE: The problem was faulty hardware on the switch. Thanks to all of you for the good debugging suggestions. Correct answer given to MattyB for suggesting using a different switch to see if the problem persisted.

Hello serverfault,

I am attempting to debug an issue on several nodes that are repeatedly detecting link loss for 1-2 minutes at a time, when there should be no link loss.

Servers:
- HP DL360 G5
- 1 on-board 2-port Broadcom NetXtreme II BCM5708 Gigabit Ethernet (rev 12) (using bnx2 driver)
- 1 4-port Intel 82571EB Gigabit Ethernet Controller (Copper) (rev 06) (using e1000e driver)

Facts:
- On all nodes, both Broadcom ports and one Intel port are connected to the same switch.
- UPDATE: Link loss is detected on ports on both NICs, Broadcom and Intel
- All ports are at Gb/s speed, except the Intel ports on two of the nodes, which are at 100Mb/s speed. All speeds set using auto-negotiation.
- All nodes were recently upgraded from RHEL 5.0 to RHEL 5.3.

I am currently attempting to get access to the switch to force Gbps/full duplex links. Is there anything other than that that could be done to diagnose or fix this issue? What further information would be useful?

EDIT: I've run tcpdump on one of the affected interfaces, and all I can see are LLDP packets, and a single IGMP Group Membership Query. I have also set the switch to force all ports to 1000Mbps links, full duplex. Does this indicate that the problem is internal to the node, and not caused by any settings on the switch?

====== Log messages ======
Oct 29 11:30:36 db1 kernel: bnx2: eth1 NIC Copper Link is Down Oct 29 11:30:37 db1 kernel: bnx2: eth0 NIC Copper Link is Down Oct 29 11:30:39 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Oct 29 11:30:39 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex Oct 29 11:31:08 db1 kernel: bnx2: eth0 NIC Copper Link is Down Oct 29 11:31:10 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Oct 29 12:56:41 db1 kernel: bnx2: eth1 NIC Copper Link is Down Oct 29 12:56:41 db1 kernel: bnx2: eth0 NIC Copper Link is Down Oct 29 12:58:34 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex Oct 29 12:58:34 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex Oct 29 12:59:02 db1 kernel: bnx2: eth1 NIC Copper Link is Down Oct 29 12:59:03 db1 kernel: bnx2: eth0 NIC Copper Link is Down Oct 29 12:59:05 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Oct 29 12:59:05 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex Oct 29 12:59:34 db1 kernel: bnx2: eth0 NIC Copper Link is Down Oct 29 12:59:35 db1 kernel: bnx2: eth1 NIC Copper Link is Down Oct 29 12:59:37 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON

====== ethtool output for all connected interfaces on one node ======
[root@db1 ~]# ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: g Link detected: yes [root@db1 ~]# ethtool eth1 Settings for eth1: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: g Link detected: yes [root@db1 ~]# ethtool eth2 Settings for eth2: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbag Wake-on: d Current message level: 0x00000001 (1) Link detected: yes
David Corsalini over 14 years

I know there are some issues with the bnx2 driver. If your version of RHEL is purchased, why not contact Red Hat support or the server vendor support?
James over 14 years

TSO and TOE are totally different things. TSO good, TOE bad.
David Corsalini over 14 years

In my experience, especially where network traffic is involved (like heavily loaded checkpoint FW clusters), TSO is as bad as ToE
nickthecook over 14 years

Thanks! The problem was indeed the switch. Apparently it eventually rebooted spontaneously, tipping off the guys on-site to the fact that it was having issues. I had been asking them to look at the switch for weeks, but they swore up and down that it was not a switch problem. :P