What causes the issue (possibly packet loss) in this scenario

5,719

In my experience Wireshark can return unreliable results on interfaces that are using hardware TCP-Offload. Duplicate packets are one of the symptoms of that.

That said, if you're using a span/mirror port to grab your captures duplicate acks on the wire are a significant problem.

Duplicate ACKs, out-of-orders, and retransmits are signals that the TCP stack on something is not behaving right. Correlating which network nodes are prone to throwing the errors will help isolate which hosts need further investigating. Any differences in network captures between a span/mirror port capture and a wireshark session on that specific node should help highlight problems it may be happening. If you see some, investigate updating the network drivers as those are frequently the easiest fix for that kind of issue (Broadcom is sadly notorious for this). Second to that, updating the firmware for the NICs can help as well.

If everything there looks healthy, you could just be seeing the normal flailing about wildly that TCP does when there is just plain too much traffic to handle.

TCP Zero-Window is also a sign of an unhealthy TCP/IP stack, though in my experience that sometimes occurs when two different TCP/IP stacks aren't getting along together. Such as can happen with Windows 2008 and certain older TCP/IP stacks in the Linux space.

Share:
5,719
Mr Shoubs
Author by

Mr Shoubs

Hello, A Software Developer for BTC Solutions, spend my day keeping a shipload of pirates in order. http://uk.linkedin.com/pub/daniel-shoubridge/13/338/470/

Updated on September 17, 2022

Comments

  • Mr Shoubs
    Mr Shoubs over 1 year

    I'm trying to diagnose a network related problem - please understand these points before suggesting an answer (apologies if more information is required, I will add anything people ask).

    • We have a server only network (5 app server, 4 db servers, few other servers) that appears to be suffering packet loss between servers
    • I can see this happening on wireshare - there are a lot of TCP Retransmissions, TCP_Out-of-Order, TCP DupACK and I think some TCP_ZeroWindow packets too.
    • There appears to be a lot of Bad Checksums on the IP protocol
    • I think the network adapters have a very constant and high (90-100%) load due to the extra retries caused by this packet loss
    • As the external requests on this network increase (to the app servers) the network performance decreases
    • the app servers generate their own traffic when used by the external request
    • The external requests come through a core router and the network is on it's own segment
    • This high load "magically" dissapeared after 1-2 days, I say magically as we where only monitoring at the adapters at the time the load dropped, there is still packet loss showing in wireshark, albeit a lesser amount.
    • Nothing points to a compromised server.
    • Unfortunately we don't have physical access to any of the hardware
    • We can't disrupt the current service

    Given the above, what is the best way to determine what is causing the packet loss (we expect it to be a managed switch).

    Is there any software that can provide us with empirical evidence of what is causing the issues?

    Thanks in advance

    • joeqwerty
      joeqwerty about 13 years
      How are you seeing the packet loss in Wireshark? What are you seeing in Wireshark? Do you see a large volume of broadcast or heartbeat traffic? Do you see a large volume of duplicate ACK's or TCP Restransmits?
    • Mr Shoubs
      Mr Shoubs about 13 years
      amended - there are a lot of TCP Retransmissions, TCP_Out-of-Order, TCP DupACK and I think some TCP_ZeroWindow packets too. There are some broadcast and heartbeat traffic caused by our db cluster, but nothing out of the ordinary.
    • joeqwerty
      joeqwerty about 13 years
      OK, in my experience the symptoms you're describing are a result of network congestion. I've seen this happen with load balanced servers due to the volume of heartbeat packets that are generated. I would look to see what the exact volume of heartbeat traffic is and what the volume of general broadcast traffic is (ARP and network broadcasts). A nice tool for visualizing this and for providing analysis of the captured traffic is ColaSoft Capsa. There's a free edition available here: colasoft.com/download
    • Mr Shoubs
      Mr Shoubs about 13 years
      Thanks, I'll take a look. But the heartbeat traffic has always been there - nothing has changed on our servers at all and they've been running without any issues up until the other day. I seriously doubt that this traffic is all of a sudden the cause of the issue.
  • Mr Shoubs
    Mr Shoubs about 13 years
    Lot of useful information for me to investigate there. How can I setup capture using span/mirror? If your correct in that the retries etc is a symptom of the congested network - would the network get more congested when more retries are sent due to congestion
  • Mr Shoubs
    Mr Shoubs about 13 years
    There appears to be a lot of Bad Checksums on the IP protocol too.
  • Deb
    Deb about 13 years
    @MrShoubs Bad checksums are a sign that there is TCP-offload happening on your NICs below the layer Wireshark captures packets. A span or mirror port is a feature on your network switch where you forward all traffic passing through a specific port and mirror it on a second port. Also called a monitor port sometimes. The exact methods vary with the switch; Cisco, HP, Juniper, etc. all do it differently.