ssh will sporadically hang temporarily on fast connection

10,712

Solution 1

No packets are received for several seconds and then ~6 are sent back in quick succession.

This is symptomatic of two similar phenomena: network congestion or network discards (usually due to congestion).

In the first case, a router between here and there has a burst of traffic unrelated to your activities which cause your traffic to be buffered in some intermediate router. They'll wait their turn until there is bandwidth open to send them on their way. Congestion like this could result from anything from a sudden spike in YouTube traffic (new kitten video!!!) or even something like an attempted SYN_ACK attack. In practice, there are far more attempted malfeasant attacks than we'd like because there are a huge number of infected machines out there which will spontaneously throw traffic at a random device somewhere on the planet. Even though SYN_ACK and similar attacks are now quashed shortly after detection, even the detection and quashing can keep a router busy for a few seconds.

The second case is that your traffic hits an overloaded device and it does not buffer the traffic. Either because it has no extra buffer memory, or because buffering often causes its own problems. For example, "I've buffered traffic because the router one hop over was too busy right now so as soon as it becomes available, I'll hit it with my stored traffic, thus making it over-busy…" ad infinitum. In this case, your TCP connection will begin its exponential backoff which will cause a delay on your (sending side). Historically this was a splendid approach to coping with the very bursty internet. There are a big handful of problems with this core part of the transmission protocol but no great solutions.

Unfortunately without the devoted assistance of your ISP, telco, and various system administrators, such lag spikes are nearly impossible to diagnose. In all likelihood, the device that is oversubscribed for its peak traffic is located somewhere completely inaccessible to you and its operator may not even know it is overloaded nor care.

The Internet Protocols were designed for best effort delivery giving no guarantees that a packet would ever make it to its destination. That it works as well as it does under loads that were never imagined is, to me, a minor miracle. If you need better than the public internet can provide, someone would probably be happy to sell you a dedicated line from you to your destination for some arbitrarily high price. Otherwise, like freeway traffic or random overlong queues at the grocery store, it might just have to be an inconvenience of modern life that you just have to live with.

As a side-note, physical proximity is poorly correlated with topological proximity. For a good time, try traceroute destination-host and marvel at just how many devices your traffic traverses between here and there. It is not unusual for a 1km transfer to go a megameter and 20 devices to get to its destination.

added in response to comment:

I have never noticed the issue occurring when I use PuTTY on a windows partition on the same machine.

Does your statement "on a windows partition" mean "running on windows"? I'll assume it does.

Without more precise data I would first presume that your not noticing it was most likely you not noticing it but I'm not a certain of that. An alternate hypothesis is that the latency spikes are not happening with PuTTY which apparently does use a different SSH implementation. If you could quantify the lack of latency spikes as you did in the ping graph above that would help distinguish between network and client issues.

To get more transfer data, I'd use PuTTY scp to copy large files between your machine and the host in question. You can use wireshark to record inter-packet times.

There are a couple of flaws with the ping test in your graph. The first is that ping uses a ICMP packets which are quite distinct from TCP/IP and are frequently given less priority than IP traffic and are more likely to be discarded by intermediate routers. As a quick check, those data are useful, but if you want to track TCP/IP connections it is best to use IP packets which is why I recommend scp. You could also use the same scp / wireshark combination under unix for comparative purposes.

The other problem with the ping test is that 60 seconds is too short a period to get a decent picture of periodic behavior. Since you already appear to have summarization tools at hand, 10 minutes would be better than 1 minute and an hour better still.

When testing, I'd vary the data I'm passing between machines. Here's a very quick-and-dirty script to generate files with much entropy and almost none:

#!/usr/bin/env python2.7

import random

def data_bytes(outf, ordered=False):
    """write a series of ordered or random octets to outf"""
    for block in range(1024):
        for char in range(1024):
            if ordered:
                c = char % 0x100
            else:
                c = random.randint(0, 0xff)
            outf.write(chr(c))

def main():
    with open('random.dat', 'wb') as outf:
        data_bytes(outf, ordered=False)
    with open('sequen.dat', 'wb') as outf:
        data_bytes(outf, ordered=True)

if __name__ == '__main__':
    main()

please forgive me if this bit is patently obvious.

Your anecdotal observation makes this an interesting question. It does need hard data to get further.

Solution 2

on the off-chance that you haven't already tried this, you could try adding a keep-alive for your ssh client. just add

ServerAliveInterval 30

somewhere in ~/.ssh/config and restart ssh.

Share:
10,712
MattLBeck
Author by

MattLBeck

Updated on September 18, 2022

Comments

  • MattLBeck
    MattLBeck almost 2 years

    I am using Ubuntu 13.04 on my laptop, connected to my router at home. When working from home, I will ssh into the servers on campus, through a vpn, with X11 forwarding.

    ssh -X server.address.on.campus
    

    I have a connection that is usually about 40 Mb/s and I only live a few miles away so the terminal is just as responsive as if I was using ssh on the campus network. However, the difference is that the connection from home has a tendency to "hang" for about 10-15 seconds every few minutes before resuming (all keystrokes I made during the hang are clearly sent because my screen is updated with them after the hang). There is no discernible pattern to the hangs. It usually happens (or is most noticeable) when I am typing something out.

    Does anyone have any ideas how I could mitigate this issue or what might be causing it? Reading around the internet, there are various issues with ssh hanging (usually permanently) but no solutions for my specific issue.

    UPDATE: I still have this issue. As suggested by @Anthon, I left ping running until ssh hung again. I've plotted the results below, and it is quite clear where the temporary hang happens. No packets are received for serval seconds and then ~6 are sent back in quick succession.

    enter image description here

    Also: I have never noticed the issue occurring when I use PuTTY on a windows partition on the same machine.

    • Jpark822
      Jpark822 about 11 years
      Is it only the SSH connection which hangs or do other programs freeze at the same time? Goal: debug if it is a local laptop, an SSH or a network problem.
    • MattLBeck
      MattLBeck about 11 years
      It's just the SSH - no other programs are behaving oddly. I could do with having some other server on a different network that I could connect to but I don't know of any off the top of my head.
    • tink
      tink about 11 years
      Have you run a tcpdump? I appreciate that, given you're using vpn and ssh there won't be much meaningful content, but packet sizes, timings and such may still prove useful.
    • Anthon
      Anthon about 11 years
      "No other programs are behaving oddly" + "you could do with some other server"? Did you mean you have no other network access programs tried? You should at least try googling something and see if 'instant search' is responsive.
    • MattLBeck
      MattLBeck about 11 years
      @Anthon Other programs requiring network access (for example web browsers) appear to be fine. Although its quite possible that this issue is much harder to notice when using them. I meant I want to try ssh to a different server to see if the issue is on my end or the campus network's.
    • Anthon
      Anthon about 11 years
      @kikumbob I was specifically referring to the instant search as you will notice delays because of a problems e.g. in your router with that better than with a normal fetch-me-a-url. Have you tried pinging-the server while you experience the problems to see if that has a delay as well?
    • Jpark822
      Jpark822 about 11 years
      WAG : Can it be a low entropy problem? In which case only encrypted connections might stall. No idea how to debug or detect that though.
    • Nils
      Nils over 10 years
      What happens when you do a ping with ping-size of 1500 bytes and set the "do not fragment" flag on that ping?
  • MattLBeck
    MattLBeck almost 11 years
    Thanks for the detailed info on possible causes! One thing I forgot to mention was that I have never noticed the issue when using PuTTY. Any ideas about this difference?
  • msw
    msw almost 11 years
    @kikumbob please see "added" above
  • MattLBeck
    MattLBeck almost 11 years
    Thanks! I will make a proper experiment of this when I have some time.
  • Nils
    Nils almost 11 years
    SSH is an TCP/IP protocol. As such it does not handle the MTU size itselv, but relies on the corresponding layers to do their job. Have you any links for that "going crazy"?
  • falco
    falco almost 11 years
    It's my self experience, but here are some threads about similar issues: snailbook.com/faq/mtu-mismatch.auto.html boston-linux-unix-general-discussion-list.996279.n3.nabble.c‌​om/… I can't remember where I found a proper explanation, it was a couple of years ago, but I'll try to find some.
  • falco
    falco almost 11 years
    My own experience is when I try to set MTU bigger than 1500, an ssh connection will hang for sure when I run something bigger (e.g. ls or less or similar )
  • falco
    falco almost 11 years
    Here's another link: serverfault.com/questions/146201/… Read the question and the last answers.