Clarification about Linux TCP window size and delays

linux tcp tcpdump tcp-slow-start

17,220

Solution 1

After doing a little more digging into my traffic, I was able to see that my data was nothing but a sequence of small bursts with small idle periods between them.

With the useful tool ss, I was able to retrieve the current congestion window size of my connection (see the cwnd value in the output):

[user@localhost ~]$ /usr/sbin/ss -i -t -e | grep -A 1 56001

ESTAB 0 0 192.168.1.1:56001
192.168.2.1:45614 uid:1001 ino:6873875 sk:17cd4200ffff8804 ts sackscalable wscale:8,9 rto:277 rtt:74/1 ato:40 cwnd:36 send 5.6Mbps rcv_space:5792

I ran the tool several times and discovered that the congestion window size was regularly reset to the initial value (10ms, on my Linux box). The connection was constantly looping back to the slow start phase. During the slow start period, bursts with a number of messages exceeding the window size were delayed, waiting for the acks related to the first packets of the burst.

The fact that the traffic consists of a sequence of bursts likely explains the reset of the congestion window size.

By deactivating the slow start mode after idle period, I was able to get rid of the delays.

[user@host ~]$ cat /proc/sys/net/ipv4/tcp_slow_start_after_idle 0

Solution 2

This isn't going to be some subtle thing like a setting somewhere. This is going to be a problem with the protocol layered on top of TCP or a code bug. There's no magic "go faster" switch for TCP except for unusual cases like networks with very high latency or packet loss caused by noise.

The most obvious explanation would be if the code calls write or send with very small chunks. You need to accumulate at least 2KB per send, ideally 16KB. You say you batch the messages, but it's not clear what that means. Do you pass them in one call to write or send? Do you bundle them into a single protocol data unit for the protocol layered on top of TCP? Doing both of these things helps a lot with latency.

Also, get rid of TCP_NODELAY. It can reduce throughput. It's only for applications that weren't designed to work with TCP or for applications that cannot predict which side will need to transmit next.

Unless, of course, you are in fact layering a protocol on top of TCP where you don't know which side is going to transmit next (like telnet, for example). Then it can make sense to set TCP_NODELAY. Significant expertise is required to make that kind of protocol work with low latency. If that's your situation, post more details about the protocol you're layering on top of TCP, what its protocol data unit sizes look like, and what determines which side transmits when.

If you do in fact batch the messages available at one time and pass them in a single call to write or send, then most likely the problem is that the other side is not sending an application-layer acknowledgement for each batch. These improve latency by giving TCP ACKs packets to piggyback on. Your protocol should include them to ensure sides alternate which helps keep latency down.

17,220

jlafaye

Updated on September 18, 2022

Comments

jlafaye almost 2 years
I have delays when sending data through a TCP channel I am not able to understand. The link is a 1Gb link with a end to end latency of roughly 40ms. In my current setup, latency (the time from one message to go from the sender user space to the receiver user space) can reach 100ms.

The sender socket is configured with the TCP_NODELAY option. Sender buffer (SO_SNDBUF) is configured to be 8MB. Receive buffer (SO_RCVBUF) is also configured to be 8MB. Tcp window scaling is activated.

update-1: I am using the zeromq 3.1.1 middleware to carry data. Socket configuration, including the TCP_NODELAY flag is performed by the middleware. Some options are accessible like rx and tx emit buffer sizes but not TCP_NODELAY. As far as I have understood, the TCP_NODELAY is activated to ensure that the data is sent as possible. In the meantime, actual socket sends and decision to send a message are performed in two separate threads. A proper batching is done if several messages are available at the time the first message in the batch is to be sent.

I ran a capture with tcpdump from which the frames below have been extracted. After the initial TCP handshake, the sender (172.17.152.124) starts sending data. The initial window size is 5840 bytes for the receiver & 5792 bytes for the sender.

My problem is that the sender sends two frames (#6 and #7) then stops, waiting for an ack to come back from the receiver. As far as I can see, the window size of the receiver is not reached and the transfer should not stop (384 bytes outstanding with an initial receive window size of 5840 bytes). I am starting to think that I have no understood correctly what TCP is. Can someone help clarifying ?

update-2: My data payload consists of a magic number followed by a timestamp. I have isolated the delayed packets by comparing the timestamps of the payloads with the timestamps put by tcpdump. The payload ts of frame #9 is very close to the one of frame #6 and #7 and clearly less than the timestamp of the received ack in frame #8.

update-1: The fact that frame #9 is not sent immediately can be explained by the slow-start of the TCP channel. In fact, the problem also appears once the connection is running for several minutes so the slow-start does not seem to be the general explanation.
1. 20:53:26.017415 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [S], seq 2473022771, win 5840, options [mss 1460,sackOK,TS val 4219180820 ecr 0,nop,wscale 8], length 0
2. 20:53:26.017423 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [S.], seq 2948065596, ack 2473022772, win 5792, options [mss 1460,sackOK,TS val 186598852 ecr 219180820,nop,wscale 9], length 0
3. 20:53:26.091940 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 1, win 23, options [nop,nop,TS val 4219180894 ecr 186598852], length 0
4. 20:53:26.091958 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [P.], seq 1:15, ack 1, w in 23, options [nop,nop,TS val 4219180895 ecr 186598852], length 14
5. 20:53:26.091964 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], ack 15, win 12, options [nop,nop,TS val 186598927 ecr 4219180895], length 0
6. 20:53:26.128298 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1:257, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 256
7. 20:53:26.128519 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 257:385, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 128
8. 20:53:26.202465 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 257, win 27, options [nop,nop,TS val 4219181005 ecr 186598963], length 0
9. 20:53:26.202475 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], seq 385:1833, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 1448
10. 20:53:26.202480 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1833:2305, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 472
If this matters, both ends are Linux RHEL5 boxes, with 2.6.18 kernels and network cards are using e1000e drivers.

update-3 Content of /etc/sysctl.conf
```
[jlafaye@localhost ~]$ cat /etc/sysctl.conf | grep -v "^#" | grep -v "^$" 
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.ipv4.tcp_rmem = 65536 4194304 16777216
net.ipv4.tcp_wmem = 65536 4194304 16777216 
net.core.netdev_max_backlog = 10000 
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_mem = 262144 4194304 16777216
kernel.shmmax = 68719476736
```
- Sumeet Kashyap about 12 years
  
  How exactly do you know that packet 9 should have been sent immediately after 6/7? I notice it doesn't have the PUSH flag set.
- ewwhite about 12 years
  
  Can we see your /etc/sysctl.conf?
- jlafaye about 12 years
  
  Ticket edited with the content of /etc/sysctl.conf
Sumeet Kashyap about 12 years

I disagree: there's nothing wrong with small writes if you want it to be sent immediately. He's trying to design a very low latency system.
David Schwartz about 12 years

There most certainly is. If you issue a small write when there's other data you want to send immediately, you are senselessly delaying the data you didn't include in the small write. (As the OP's problem demonstrates.)
Sumeet Kashyap about 12 years

He has said "A proper batching is done if several messages are available at the time the first message in the batch is to be sent". Not knowing anything about the origin of the data, we can't say whether new data may need to be sent immediately after you've just sent a packet.
David Schwartz about 12 years

It's not clear that means the messages are aggregated into a single call to write or send. However, if they are, then the problem is probably in the design of the protocol layered on top of TCP -- most likely the absence of application-layer acknowledgements. These give ACKs packets to piggy back on and are essential for low-latency applications. (Answer updated.)
David Schwartz about 12 years

Application-layer acknowledgements would have solved this problem tool (by getting the ACKs back faster) and that wouldn't require any system-level tuning. (If you can always tune the systems this software runs on, then this is simpler. But if you can't, you should add application-layer acknowledgements.) Nice detective work, by the way.
jlafaye about 12 years

Now that you mention the detective work. I voluntarily excluded some parts of the investigation including the one running a kernel on kvm and adding gdb watchpoints on the congestion window size value.
Tim almost 10 years

@DavidSchwartz: Although application-layer acks would have solved the OP's problem, they won't solve all instances of this problem. I had this happen using two simplex ActiveMQ channels where channels were used alternately with one channel always idle while the other is used. By the time the next message was sent on a given channel, the congestion window had already closed due to the idle period, and no amount of application-level acking would have changed it. So your suggestion is a good one, but not applicable to all situations (whereas the OP's solution is applicable in all cases).