Why am I seeing retransmissions across the network using iperf3?

linux networking docker bridge

6,953

Solution 1

It appears that something (NIC or kernel?) is slowing down traffic when its being output to the bond0 interface. In the linux bridge (pod) case, the "NIC" is simply a veth which (when I tested mine) hit a peak around 47Gbps. So when iperf3 is asked to send packets out the bond0 interface, it overruns the interface and ends up with dropped packets (unclear why we see drops on the receiving host).

I confirmed that if I apply a tc qdisc class to slow down the pod interface to 10gbps, there is no loss when simply running iperf3 to the other pod.

tc qdisc add dev eth0 root handle 1:0 htb default 10
tc class add dev eth0 parent 1:0 classid 1:10 htb rate 10Gbit

This was enough to ensure that an iperf3 run without a bandwidth setting didn't incur retransmissions due to overrunning the NIC. I'll be looking for a way to slow down flows that would egress the NIC with tc.

update: Here's how to slow down traffic for everything but the local bridged subnet.

tc qdisc add dev eth0 root handle 1:0 htb default 10
tc class add dev eth0 classid 1:5 htb rate 80Gbit
tc class add dev eth0 classid 1:10 htb rate 10Gbit
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dst 10.81.18.4/24 classid 1:5

Solution 2

author of kube-router here. Kube-router relies on Bridge CNI plug-in to create kube-bridge. Its standard linux networking nothing specifically tuned for pod networking. kube-bridge is set to default value which is 1500. We have a open bug to set to some sensible value.

https://github.com/cloudnativelabs/kube-router/issues/165

Do you think errors seen are due to less MTU?

6,953

dlamotte

Updated on September 18, 2022

Comments

dlamotte over 1 year

I'm seeing retransmissions between two pods in a kubernetes cluster I'm setting up. I'm using kube-router https://github.com/cloudnativelabs/kube-router for the networking between the hosts. Here's the setup:

host-a and host-b are connected to the same switches. They are on the same L2 network. Both are connected to the above switches with LACP 802.3ad bonds and those bonds are up and functioning properly.

pod-a and pod-b are on host-a and host-b respectively. I'm running iperf3 between the pods and see retransmissions.

root@pod-b:~# iperf3 -c 10.1.1.4
Connecting to host 10.1.1.4, port 5201
[  4] local 10.1.2.5 port 55482 connected to 10.1.1.4 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.15 GBytes  9.86 Gbits/sec  977   3.03 MBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.89 Gbits/sec  189   3.03 MBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec   37   3.03 MBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.89 Gbits/sec  181   3.03 MBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    0   3.03 MBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   3.03 MBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.88 Gbits/sec  305   3.03 MBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.90 Gbits/sec   15   3.03 MBytes
[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec  126   3.03 MBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.86 Gbits/sec  518   2.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec  2348             sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.88 Gbits/sec                  receiver

iperf Done.

The catch here that I'm trying to debug is that I don't see retransmissions when I run the same iperf3 across host-a and host-b directly (not over the bridge interface that kube-router creates). So, the network path looks something like this:

pod-a -> kube-bridge -> host-a -> L2 switch -> host-b -> kube-bridge -> pod-b

Removing the kube-bridge from the equation results in zero retransmissions. I have tested host-a to pod-b and seen the same retransmissions.

I have been running dropwatch and seeing the following on the receiving host (the iperf3 server by default):

% dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
16991 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
3 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
16091 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
1 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
8463 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
15857 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
7111 drops at skb_release_data+9e (0xffffffff874c6a4e)
9037 drops at skb_release_data+9e (0xffffffff874c6a4e)

The sending side sees drops, but nothing in the amounts we are seeing here (1-2 max per line of output; which I hope is normal).

Also, I'm using 9000 MTU (on the bond0 interface to the switch and on the bridge).

I'm running CoreOS Container Linux Stable 1632.3.0. Linux hostname 4.14.19-coreos #1 SMP Wed Feb 14 03:18:05 UTC 2018 x86_64 GNU/Linux

Any help or pointers would be much appreciated.

update: tried with 1500 MTU, same result. Significant retransmissions.

update2: appears that iperf3 -b 10G ... yields no issues between pods and directly on host (2x 10Gbit NIC in LACP Bond). The issues arise when using iperf3 -b 11G between pods but not between hosts. Is iperf3 being smart about the NIC size but can't on the local bridged veth?

dlamotte about 6 years

See my first update in the original question and my answer as well, I have changed the default from 1500 MTU to 9000 MTU. I tested both 1500 and 9000 MTU both with the same result. At this point, I'm not sure what networking provider actually fixes the underlying problem. I think the problem is the pod's interface is "too fast" on the bridge network which isn't unique to kube-router if I had to guess.
fixer1234 over 5 years

Welcome to Super User. It isn't clear how this answers the question. The site is a knowledge base of solutions and relies on answers being solutions to what was asked in the question. General discussion that might be helpful, but tangential, like the first paragraph, can be posted in a comment with a little more rep. The second paragraph, about your own experience, might be something that could go in a new question, if you are seeking an answer, but is discouraged in our Q&A format.