Why am I getting long TCP connect latency on connect in a LAN (over a cross!)?

5,714

Solution 1

Well, crap. It appears I misread both the tcpdump and wireshark logs. The delay I was getting was 100 microseconds, not millis!

alt text http://ironicsurrealism.blogivists.com/files/2009/10/homer-simpson-doh.gif

Solution 2

The usual suspects:

  • Duplex mismatch

    • check on switch for collisions or errors
    • check on hosts for collisions or errors

    If you see collisions, that end is half duplex and should be set to full. If you see errors, check the other end for collisions. If both ends have errors, you may have a bad cable.

  • DNS timeouts
    • log onto one host, lookup with nslookup the IP of the other. You should get a name or an error very quickly

Solution 3

What model of Cisco switch are you using? One thing that could be happening is if the switch doesn't know which port you're server is on, it will need to flood all ports with the packet, which could take time (shouldn't take 100ms though). You can verify by running TCP dump on another server that isn't one of the two servers you are using. Once the server responds, it will then learn the port-mac assignment and do the forwarding in asic. This could be especially prevalent on lower end cisco switches.

Also, do you have per-port ACL's? That could also require CPU switching which would be orders of magnitude slower than in ASIC. Do you have the same problem when running pings, in that the first ping has 100ms delay, and then subsequent pings are <1ms? If it's a lower end switch and only getting delay on tcp/ip, I'd check that there isn't an ACL that is applied to TCP/IP packets.

I would also check the switch for CPU load, even if it's low usage, if it's got some stupid config that is causing it to switch in CPU, it can easily be overloaded. We've overloaded high end switches (10Gbps backhaul) with traffic in the 100Mbps range because we were inadvertently sending traffic that had to be switched within the CPU.

Solution 4

Have you checked the cabling? Bad cables and/or punchdowns can result in retries that can greatly increase latency.

Share:
5,714

Related videos on Youtube

Craig Vermeer
Author by

Craig Vermeer

Updated on September 17, 2022

Comments

  • Craig Vermeer
    Craig Vermeer over 1 year

    I am measuring a time of about 100-150 milliseconds from sending TCP SYN to getting SYN/ACK, between two linux computers connected to the same Cisco switch. Consider:

    • The machines are very powerful, and neither them nor the switch is heavily loaded.
    • From analyzing tcpdumps logs on the two machines I see the problem is not in the endpoints but rather in the network itself (the client sees 100-150 ms delay, but the server processes the responses in about 10 ms).
    • Only SYN requests are slow. Afterwards, a normal TCP packets gets an ACK right away.

    So, my questions are:

    • Am I right to think this is way, way too much?
    • What latency should I aim for?
    • What can I do to further diagnose and solve the issue?

    Edit - We've taken the switch out of the equation. The two computers are now connected in a cross cable, and we're still seeing the problem. Both are on full duplex, 100 MBPS.

  • Craig Vermeer
    Craig Vermeer over 14 years
    I'm comparing the time difference between the two computers, no the absolute times.
  • Russell Heilling
    Russell Heilling over 14 years
    The problem with comparing time deltas is that it's impossible to tell whether there is a big delay in one direction, or whether there is a smaller delay in both directions :(
  • Craig Vermeer
    Craig Vermeer over 14 years
    They are both connected to the same VLAN, wires, I have no reason to suspect an ARP hijack. I am experiencing the problem from all devices to all devices in the network. Does this mean it's a bad switch configuration? Note, the problem is only felt at the TCP connect phase. Other packets get an ACK quickly.
  • Craig Vermeer
    Craig Vermeer over 14 years
    DNS timeouts are irrelevant, we're testing with IP addresses. Duplex settings were verified (see updated question).
  • Craig Vermeer
    Craig Vermeer over 14 years
    Happens over all our environments, many different cables.
  • Craig Vermeer
    Craig Vermeer over 14 years
    Removed the switch from the equation (used a cross cable), and it still happens.
  • chris
    chris over 14 years
    The DNS that often causes the problem is the side listening will attempt to do a reverse lookup of the IP establishing the connection.
  • chris
    chris over 14 years
    So that's fast, right?
  • Craig Vermeer
    Craig Vermeer over 14 years
    Very much so, indeed.
  • chris
    chris over 14 years
    Back in the old days radar caused ships to sometimes crash into each other because the pilot would forget to reset the scale of the readout. "Don't worry, that ship is miles away." "Feet? oops."
  • user234702
    user234702 over 14 years
    That is incredibly weird??? What do you're ping results look like? Also, I don't know if it helps, but I once encountered a problem on freebsd where a kernel operation was freezing the entire system for a few seconds. We noticed this because we were running VRRP and every day or two the VIP would fail over. We finally matched it to a kernel log that had something to do with memory. At this point I would check to make sure you're not using cheapo nics, latest drivers, and maybe look for related kernel problems/logs, since that will also affect you're TCP-dump causing it to look normal.
  • user1227502
    user1227502 over 11 years
    @chris - interesting. Have you got a source for that anecdote?