Cluster failover and strange gratuitous arp behavior

11,063

Solution 1

Do you have the latest cluster hotfixes applied? There are some fairly serious known defects.

A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working
https://support.microsoft.com/kb/2550886

Slow failover operation if no router exists between the cluster and an application server
https://support.microsoft.com/kb/2582281

"This issue occurs because the TCP/IP stack of the application server incorrectly ignores gratuitous Address Resolution Protocol (ARP) requests."

Solution 2

I've started to see machines getting incorrect ARP table entries for several SQL Server instances in a failover cluster.

Client servers are alternatively populating their ARP tables with MAC addresses from the correct NIC team and the MAC address from one of the physical NICs (not the necessarily the corresponding NIC team MAC on that server) on a different cluster node.

This is causing intermittent connection failures for clients on the same LAN as the SQL Cluster.

This behavior has been noted by both VM clients as well as physical boxes.

This occurs after a failover and lasts for days.

In order to mitigate this, I've had to set static arp entries on the more troublesome clients.

ENVIRONMENT:

  • Windows 2008 R2 SP1 Servers in a failover cluster
  • SQL Server 2008 R2 Instances
  • Teamed Intel Gigabit NICS
  • HP 28XX switches
  • Virtual Machines hosted on Windows Server 2008 R2 SP1 Hyper-V

The Intel NIC team creates a virtual adapter with the MAC address of one of the physical NICs.

I have a suspicion that the Intel NIC teaming software is the culprit, but any other troubleshooting thoughts or solutions would be appreciated.

I'm likely going to rebuild the cluster hosts with Server 2012 and use the in-box NIC teaming there (as I have not seen that issue with my testing with that platform).

Solution 3

This is purely speculative, but my guess is that there may be some bad interaction with RLB being enabled (Which gets turned on by default, and with Lazerpld, Steven, and Stack Exchange have all hit whatever this bug is now). From the Intel teaming whitepaper:

Receive load balancing (RLB) is a subset of ALB. It allows traffic to flow in both Tx and Rx on all adapters in the team. When creating an RLB team in Windows, this feature is turned on by default. It can be disabled via the Intel® PROSet GUI using the team’s Advanced Settings.

In RLB mode, when a client is trying to connect to a team by sending an ARP request message, Intel ANS takes control of the server ARP reply message coming from the TCP stack in response. Intel ANS then copies into the ARP reply the MAC address of one of the ports in the team chosen to service the particular end client, according to the RLB algorithm. When the client gets this reply message, it includes this match between the team IP and given MAC address in its local ARP table. Subsequently, all packets from this end client will be received by the chosen port. In this mode, Intel ANS allocates team members to service end-client connections in a round-robin fashion, as the clients request connections to the server. In order to achieve a fair distribution of end clients among all enabled members in the team, the RLB client table is refreshed at even intervals (default is five minutes). This is the Receive Balancing Interval, which is a preconfigured setting in the registry. The refresh involves selecting new team members for each client as required. Intel ANS initiates ARP Replies to the affected clients with the new MAC address to connect to, and redistribution of receive traffic is complete when all clients have had their ARP tables updated by Intel ANS.

The OS can send out ARP requests at any time, and these are not under the control of the Intel ANS driver. These are broadcast packets sent out through the primary port. Since the request packet is transmitted with the team’s MAC address (the MAC address of the primary port in the team), all end clients that are connected to the team will update their ARP tables by associating the team’s IP address with the MAC address of the primary port. When this happens, the receive load of those clients collapses to the primary port.

To restart Rx load balancing, Intel ANS sends a gratuitous ARP to all clients in the receive hash table that were transmitting to non-primary ports, with the MAC address of the respective team members. In addition, the ARP request sent by the OS is saved in the RLB hash table, and when the ARP reply is received from the end client, the client’s MAC address is updated in the hash table. This is the same mechanism used to enable RLB when the server initiates the connection.

So my theory is that perhaps when windows clustering releases the virtual IP, than the Intel driver doesn't see that the IP has been released, and continues to announce it. That being said, right now this is just a theory.

Share:
11,063
lazerpld
Author by

lazerpld

Updated on September 18, 2022

Comments

  • lazerpld
    lazerpld almost 2 years

    I am experiencing a strange Windows 2008R2 cluster related issue that is bothering me. I feel that I have come close as to what the issue is, but still don't fully understand what is happening.

    I have a two node exchange 2007 cluster running on two 2008R2 servers. The exchange cluster application works fine when running on the "primary" cluster node. The problem occurs when failing over the cluster ressource to the secondary node.

    When failing over the cluster to the "secondary" node, which for instance is on the same subnet as the "primary", the failover initially works ok and the cluster ressource continues to work for a couple of minutes on the new node. Which means that the recieving node does send out a gratuitous arp reply packet that updated the arp tables on the network. But after x amount of time (typically within 5 minutes time) something updates the arp-tables again because all of a sudden the cluster service does not answer to pings.

    So basically I start a ping to the exchange cluster address when its running on the "primary node". It works just great. I failover the cluster ressource group to the "secondary node" and I only have loss of one ping which is acceptable. The cluster ressource still answers for some time after being failed over and all of a sudden the ping starts timing out.

    This is telling me that the arp table initially is updated by the secondary node, but then something (which I haven't found out yet) wrongfully updates it again, probably with the primary node's MAC.

    Why does this happen - has anyone experienced the same problem?

    The cluster is NOT running NLB and the problem stops immidiately after failing over back to the primary node where there are no problems.

    Each node is using NIC teaming (intel) with ALB. Each node is on the same subnet and has gateway and so on entered correctly as far as I am concerned.

    Edit:
    I was wondering if it could be related to network binding order maybe? Because I have noticed that the only difference I can see from node to node is when showing the local arp table. On the "primary" node the arp table is generated on the cluster address as the source. While on the "secondary" its generated from the nodes own network card.

    Any input on this?

    Edit:
    Ok here is the connection layout.

    Cluster address: A.B.6.208/25 Exchange application address: A.B.6.212/25

    Node A: 3 physical nics. Two teamed using intels teaming with the address A.B.6.210/25 called public The last one used for cluster traffic called private with 10.0.0.138/24

    Node B: 3 physical nics. Two teamed using intels teaming with the address A.B.6.211/25 called public The last one used for cluster traffic called private with 10.0.0.139/24

    Each node sits in a seperate datacenter connected together. End switches being cisco in DC1 and NEXUS 5000/2000 in DC2.

    Edit:
    I have been testing a little more. I have now created an empty application on the same cluster, and given it another ip address on the same subnet as the exchange application. After failing this empty application over, I see the exact same problem occuring. After one or two minutes clients on other subnets cannot ping the virtual ip of the application. But while clients on other subnets cannot, another server from another cluster on the same subnet has no trouble pinging. But if i then make another failover to the original state, then the situation is the opposite. So now clients on same subnet cannot, and on other they can. We have another cluster set up the same way and on the same subnet, with the same intel network cards, the same drivers and same teaming settings. Here we are not seeing this. So its somewhat confusing.

    Edit:
    OK done some more research. Removed the NIC teaming of the secondary node, since it didnt work anyway. After some standard problems following that, I finally managed to get it up and running again with the old NIC teaming settings on one single physical network card. Now I am not able to reproduce the problem described above. So it is somehow related to the teaming - maybe some kind of bug?

    Edit:
    Did some more failing over without being able to make it fail. So removing the NIC team looks like it was a workaround. Now I tried to reestablish the intel NIC teaming with ALB (as it was before) and i still cannot make it fail. This is annoying due to the fact that now i actually cannot pinpoint the root of the problem. Now it just seems to be some kind of MS/intel hick-up - which is hard to accept because what if the problem reoccurs in 14 days? There is a strange thing that happened though. After recreating the NIC team I was not able to rename the team to "PUBLIC" which the old team was called. So something has not been cleaned up in windows - although the server HAS been restarted!

    Edit:
    OK after restablishing the ALB teaming the error came back. So I am now going to do some thorough testing and i will get back with my observations. One thing is for sure. It is related to Intel 82575EB NICS, ALB and Gratuitous Arp.


    I am somehow happy to hear that :) I am now going to find out what causes this by doing intensive testing. Hope to get back with some results. I have not seen these problems with Broadcom.

    @Kyle Brandt: What driver versions do you have on the system you saw this happen on? Please provide both NIC driver version and Teaming Driver version.

    I am running 11.7.32.0 and 9.8.17.

    I know for a fact that these drivers are VERY old indeed - but as this problem is only ocurring periodically it is very hard to troubleshoot if updating drivers is solving the issue. As of now i have fx tried to use this action plan: 1. Remove ALB teaming - Could not provoke the error to happen 2. Reestablish ALB teaming - The issue appeared again 3. Try AFT (Adapter Fault Tolerance) - Issue gone again 4. Install newest drivers and run ALB teaming again (tried with 11.17.27.0) - Issue gone 5. Roll drivers back - This action is now pending - but until now the system works fine.

    yet again i find it frustratingly hard to troubleshoot this periodic problem, as i now have no idea which of the above steps solved the issue. Most propably it was after installing new drivers - but i dont know for a fact right now.

    I hope that some of you who are experiencing the same issue can add some notes/ideas/observations so that we can the to the root of this.

  • lazerpld
    lazerpld almost 12 years
    Dear Greg, I have seen these hotfixes around, but non of the descriptions kind of fit into the above. The first because i just need to fail the application back for it to work again. The latter because the failover is not slow in anyway. It fails over just great. But thanks for the input!
  • longneck
    longneck almost 12 years
    you can configure the NIC team to have any MAC address you want. have you tried assigning it a different MAC?
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    I haven't, but the team MAC isn't necessarily the issue, it's that MACs from the underlying NICs on the other node are finding their way to the ARP table on the clients.
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    I'll give it a shot though. Thanks!
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    Both accounts mention Intel NICs.
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    I hear you... I can't wait to begin migrating to Server 2012.. the in-box LBFO NIC teaming has been pretty solid in testing and really easy to configure.
  • lazerpld
    lazerpld almost 12 years
    Steven could you maybe also try to remove the teaming and just run on one NIC to see if it solves your problem? Then we at least have something in common :)
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    We are working on setting up a maintenance window to try that. Breaking the team will fail the node in the cluster, so we have to wait for a maintenance window to do anything like that.
  • Jerrish Varghese
    Jerrish Varghese almost 12 years
    One test of that theory would be to disable that feature... the problem is that can cause a network interruption, which can initiate a failover...
  • mdpc
    mdpc over 11 years
    Its really not appropriate to answer a question with a "me too" type of answer. Thanks.
  • VictorSilva
    VictorSilva over 11 years
    Are you serious? It's really appropriate to read the full post before you write something, buddy. The author from this post wrote: " I hope that some of you who are experiencing the same issue can add some notes/ideas/observations so that we can the to the root of this." Well, I think my "answer" was an observation for that issue. It has a slight difference from the others. If you really have nothing do add to our discussion, please, read and keep your thoughts for yourself. Thank you