Mysterious “fragmentation required” rejections from gateway VM

networking vmware-esxi linux-networking gateway

10,841

Solution 1

I finally got to the bottom of this. It turned out to be an issue with VMware's implementation of TCP segmentation offloading in the virtual NIC of the target server.

The server's TCP/IP stack would send one large block along to the NIC, with the expectation that the NIC would break this into TCP segments restricted to the link's MTU. However, VMware decided to leave this in one large segment until - well, I'm not sure when.

It seems it actually stayed one large segment when it reached the gateway VM's TCP/IP stack, which elicited the rejection.

An important clue was buried in the resulting ICMP packet: the IP header of the rejected packet indicated a size of 2960 bytes - way larger than the actual packet it appeared to be rejecting. This is also exactly the size a TCP segment would be on the wire if it had combined the data from both of the segments sent thus far.

One thing that made the issue very hard to diagnose was that the transmitted data actually was split into 1500-byte frames as far as WireShark running on another VM (connected to the same vSwitch on a separate, promiscuous port group) could see. I'm really not sure why the gateway VM saw one packet while the WireShark VM saw two. FWIW, the gateway doesn't have large receive offload enabled - I could understand if it did. The WireShark VM is running Windows 7.

I think VMware's logic in delaying the segmentation is so that if the data is to go out a physical NIC, the NIC's actual hardware offload can be leveraged. It does seem buggy, however, that it would fail to segment before sending into another VM, and inconsistently, for that matter. I've seen this behaviour mentioned elsewhere as a VMware bug.

The solution was simply to turn off TCP segmentation offloading in the target server. The procedure varies by OS but fwiw:

In Windows, on the connection's properties, General tab or Networking tab, click "Configure..." beside the adapter, and look on the Advanced tab. For Server 2003 R2 it's given as "IPv4 TCP Segmentation Offload." For Server 2008 R2 it's "Large Send Offload (IPv4)."

This solution is a bit of a kludge and could conceivably impact performance in some environments, so I'll still accept any better answer.

Solution 2

You can't drop ICMP fragmentation required messages. They're required for pMTU discovery, which is required for TCP to work properly. Please LART the firewall administrator.

By the transparency rule, a packet-filtering router acting as a firewall which permits outgoing IP packets with the Don't Fragment (DF) bit set MUST NOT block incoming ICMP Destination Unreachable / Fragmentation Needed errors sent in response to the outbound packets from reaching hosts inside the firewall, as this would break the standards-compliant usage of Path MTU discovery by hosts generating legitimate traffic. -- Firewall Requirements - RFC2979 (emphasis in original)

This is a configuration that has been recognized as fundamentally broken for more than a decade. ICMP is not optional.

Solution 3

I had the same symptoms and the problem turned out to be this kernel bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754294

10,841

WookieeKushin

Updated on September 18, 2022

Comments

WookieeKushin almost 2 years
I've been troubleshooting a severe WAN speed issue. I fixed it, but for the benefit of others:

Via WireShark, logging, and simplifying the config I narrowed it down to some strange behaviour from a gateway doing DNAT to servers on the internal network. The gateway (a CentOS box) and servers are both running in the same VMware ESXi 5 host (and this turns out to be significant).

Here is the sequence of events that happened - quite consistently - when I attempted to download a file from an HTTP server behind the DNAT, using a test client connected directly to the WAN side of the gateway (bypassing the actual Internet connection normally used here):
1. The usual TCP connection establishment (SYN, SYN ACK, ACK) proceeds normally; the gateway remaps the server's IP correctly both ways.
2. The client sends a single TCP segment with the HTTP GET and this is also DNATted correctly to the target server.
3. The server sends a 1460 byte TCP segment with the 200 response and part of the file, via the gateway. The size of the frame on the wire is 1514 bytes - 1500 in payload. This segment should cross the gateway but doesn't.
4. The server sends a second 1460 byte TCP segment, continuing the file, via the gateway. Again, the link payload is 1500 bytes. This segment doesn't cross the gateway either and is never accounted for.
5. The gateway sends an ICMP Type 3 Code 4 (destination unreachable - fragmentation needed) packet back to the server, citing the packet sent in Event 3. The ICMP packet indicates the next hop MTU is 1500. This appears to be nonsensical, as the network is 1500-byte clean and the link payloads in 3 and 4 already were within the stated 1500 byte limit. The server understandably ignores this response. (Originally, ICMP had been dropped by an overzealous firewall, but this was fixed.)
6. After a considerable delay (and in some configurations, duplicate ACKs from the server), the server decides to resend the segment from Event 3, this time alone. Apart from the IP identification field and checksum, the frame is identical to the one in Event 3. They are the same length and the new one still has the Don't Fragment flag set. However, this time, the gateway happily passes the segment on to the client - in one piece - instead of sending an ICMP reject.
7. The client ACKs this, and the transfer continues, albeit excruciatingly slowly, since subsequent segments go through roughly the same pattern of being rejected, timing out, being resent and then getting through.
The client and server work together normally if the client is moved to the LAN so as to access the server directly.

This strange behaviour varies unpredictably based on seemingly irrelevant details of the target server.

For instance, on Server 2003 R2, the 7MB test file would take over 7h to transmit if Windows Firewall was enabled (even if it allowed HTTP and all ICMP), while the issue would not appear at all, and paradoxically the rejection would never be sent by the gateway in the first place if Windows Firewall was disabled. On the other hand, on Server 2008 R2, disabling Windows Firewall had no effect whatsoever, but the transfer, while still being impaired, would occur much faster than on Server 2003 R2 with the firewall enabled. (I think this is because 2008 R2 is using smarter timeout heuristics and TCP fast retransmission.)

Even more strangely, the problem would disappear if WireShark were installed on the target server. As such, to diagnose the issue I had to install WireShark on a separate VM to watch the LAN side network traffic (probably a better idea anyway for other reasons.)

The ESXi host is version 5.0 U2.
WookieeKushin about 11 years

I wholeheartedly agree: a proper config should allow the ICMP replies. I can hopefully get the local firewalls involved changed. My concern, however, is that - at least if what I've read is true - it's very common for Internet firewalls to disregard this requirement. So even if we fix this locally, it may fail in production due to outside firewalls we have no control over. In a similar setup I had to use MSS clamping so I was figuring on having to do that here too. I'm still mystified as to how some packets worked while others didn't when they were all going over the same direct link.
David Schwartz about 11 years

@Kevin: It is almost impossible to understand how complex, invalid arrangements will fail. It is many, many times harder that understanding how complex, valid arrangements succeed. Perhaps you can tell by fixing the firewall and then looking at what happens differently.
WookieeKushin about 11 years

Thank you for the good suggestion. I did get the firewall policy changed. It wasn't the root cause (as I'd figured,) but it did get more information onto the wire that was useful in finding the root cause.
Shade almost 10 years

+1 using Ubuntu 12.04 with 3.2.0-67-generic. The workaround is to set GRO off: ethtool -K eth0 gro off