Why is our firewall (Ubuntu 8.04) rejecting the final packet (FIN, ACK, PSH) with a RST

17,301

Solution 1

A similar situation is described at http://www.spinics.net/lists/netfilter/msg51408.html: some packets which should have been processed by NAT somehow got marked as INVALID instead of ESTABLISHED, and went to the INPUT chain. You should add some rules with -m state --state INVALID to check for this, and the answer at http://www.spinics.net/lists/netfilter/msg51409.html suggests that such INVALID packet should always be DROPped, because NAT is not performed on them properly, therefore addresses in them may be wrong.

If your problematic packets are really marked as INVALID, adding iptables -I INPUT -m state --state INVALID -j DROP probably will work around the problem (the broken packet will not get to the local process and will not cause the RST response, then TCP will recover from the lost packet after a timeout). Then you can try to debug the problem further, as described in http://www.spinics.net/lists/netfilter/msg51411.html:

echo 255 >/proc/sys/net/netfilter/nf_conntrack_log_invalid

(In that particular case the problem was caused by some broken networking hardware along the path, probably combined with some TCP checksum offload brokenness.)

Solution 2

I have seen this behavior on other firewall types and the behavior was so identical I figured I'd throw it out there.

The issue I had was that the firewall was NAT'ing into the same space as ephemeral ports on the box. This would cause this exact behavior if the two collided because the kernel was now assuming the connection was meant for the local machine. To this end there are a couple things you can check. First are you specifying the outbound port config in iptables (using --to-ports)? Or have you modified the ephemeral port range on the machine:

$ cat /proc/sys/net/ipv4/ip_local_port_range

To diagnose you can setup your capture and see if you see any other requests using the same external fw ip,port combo within 3*MSL time before the RST(~180s I think).

While I'm not confident that is the answer yet, if I were in this situation I would rule that out first and then look at a couple other things.

Is this easy to reproduce? Is it possible to capture more diagnostics from the firewall box and see the problem occur? I would try to capture:

$ netstat -anp
$ cat /proc/net/ip_conntrack

every second or so while trying to reproduce and see if there is something binding locally to the port and what the masquerade table looked like during the problem.

If you firewall the RST outbound does the eventual ACK from the internal client cause the connection to succeed?

Last thing, are you seeing all the logs? Have you already checked dmesg? Have you setup *.* on the firewall box in the syslog configuration to a file to make sure?

Let me know what you find! I really appreciate the amount of information you provided in the question, thanks.

Share:
17,301

Related videos on Youtube

ernelli
Author by

ernelli

Currently employed as systems architect, has been working with software and hardware development for 15 years. The interface is key

Updated on September 18, 2022

Comments

  • ernelli
    ernelli over 1 year

    Background, for a long time we have had problems with our firewall that sometimes keep HTTP requests hanging partially loaded until TCP times out.

    After tracing the traffic on the firewall I noticed that it occurs only during certain timing conditions, e.g when the webserver has sent the entire response before the client has sent its second ACK on the payload. [SYN, SYN/ACK, ACK] has been exchanged, REQUEST has been sent and ACK'ed and the first RESPONSE packet has been received and ACK'ed, then the webserver sends the rest of the response body in one shot (8 packets including the last FIN, PSH) and before the client has ACK'ed any of those, the Firewall REJECTS with a RST towards the webserver and keeps the client hanging infinite.

    Here is the entire wireshark trace with packets from both sides of the firewall. 192.168.126.161 is the private NAT'et IP address of the client. 172.16.1.2 is the webserver IP (not showing the real public IP) and 10.1.1.1 is the firewall external IP (not showing the real public IP)

    2105 0.086275 192.168.126.161  172.16.1.2       TCP 37854 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=89375083 TSER=0
    2106 0.000066 10.1.1.1         172.16.1.2       TCP 37854 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=89375083 TSER=0
    2107 0.002643 172.16.1.2       10.1.1.1         TCP http > 37854 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1460
    2108 0.007705 172.16.1.2       192.168.126.161  TCP http > 37854 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1460
    2109 0.006301 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=1 Ack=1 Win=5840 Len=0
    2110 0.000025 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=1 Ack=1 Win=5840 Len=0
    2111 0.000007 192.168.126.161  172.16.1.2       HTTP GET /test/style.css HTTP/1.1 
    2112 0.000015 10.1.1.1         172.16.1.2       HTTP GET /test/style.css HTTP/1.1 
    2113 0.001536 172.16.1.2       10.1.1.1         TCP http > 37854 [ACK] Seq=1 Ack=111 Win=32658 Len=0
    2114 0.000014 172.16.1.2       192.168.126.161  TCP http > 37854 [ACK] Seq=1 Ack=111 Win=32658 Len=0
    2115 0.002274 172.16.1.2       10.1.1.1         HTTP HTTP/1.1 200 OK  (text/css)
    2116 0.000025 172.16.1.2       192.168.126.161  HTTP HTTP/1.1 200 OK  (text/css)
    2117 0.005689 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=1461 Win=8760 Len=0
    2118 0.000024 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=1461 Win=8760 Len=0
    2119 0.001536 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2120 0.000026 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2121 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2122 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2123 0.000313 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2124 0.000030 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2125 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2126 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2127 0.000009 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2128 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2129 0.001108 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2130 0.000035 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2131 0.000008 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    2132 0.000022 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
    2133 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
    REJECT-->
    2134 0.000089 10.1.1.1         172.16.1.2       TCP 37854 > http [RST] Seq=111 Win=0 Len=0
    CLIENT FIRST ACK-->
    2135 0.002421 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=2921 Win=11680 Len=0
    2136 0.000033 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=2921 Win=11680 Len=0
    2137 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=4381 Win=14600 Len=0
    2138 0.000014 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=4381 Win=14600 Len=0
    2139 0.000008 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=5841 Win=17520 Len=0
    2140 0.000014 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=5841 Win=17520 Len=0
    2141 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=7301 Win=20440 Len=0
    2142 0.000013 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=7301 Win=20440 Len=0
    2143 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=8761 Win=23360 Len=0
    2144 0.000015 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=8761 Win=23360 Len=0
    2145 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=10221 Win=26280 Len=0
    2146 0.000013 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=10221 Win=26280 Len=0
    2147 0.001059 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=11681 Win=29200 Len=0
    2148 0.000018 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=11681 Win=29200 Len=0
    

    I have been digging and logging the packet traversal according to this chart and it seems that the last incoming packet 2133 gets past raw-PREROUTING, conntrack, mangle-PREROUTING but then is lost. I have no REJECT rules in my iptables, I log all DROP rules and none of them show where packet 2133 gets lost.

    I would love to use the TRACE target on the incoming filter, but unfortunately ubuntu 8.04 is not shipped with support for the TRACE target.

    So I believe that some internal implicit routing/conntrack/mangling rules applies that for some reason resets the connection. Maybe the traffic trigger some DOS protection, but I have no idea where to configure/analyse that. The most frustrating thing is that a packet is rejected and nothing gets logged...

    Also requesting this file works 100% from windows hosts, but it fails on certain Linux hosts and 99.9% of all requests get through but sometimes the timing of the packets triggers this behaviour in our firewall.

    EDIT Ok, now I have added tons of logging in iptables and it seems like the following happens (still dont know why!)

    For packets successfully traversing the firewall the following steps are taken, table/step references from here

    Table 3-3 step
    
    2     raw-pre
          conntrack
    3     mangle-pre
    4     [nat-pre]
    5     routing-decision -> destination forward
    6     mangle-fwd
    7     filter-fwd
    8     mangle-post
    9     [nat-post]
    

    Packet 2133 that gets rejected traverses these steps:

    Table 3-1 steps for the incoming FIN,ACK packet 2133
    2     raw-pre
          conntrack
    3     mangle-pre
    4     [nat-pre]
    5     routing-decision -> destination local
    6     mangle-input
    7     filter-input
    8     local process emits RST -> webserver
    
    Table 3-2 steps for the outgoing RST packet 2134 in response to 2133
    1     raw-out
    2     routing decision
          conntrack
    3     mangle-out
          reroute-check
    4     [nat-out]
    5     filter-out
    6     mangle-post
    7     nat-post
    

    The thing that is strange is that the routing decision for packet 2133 in step 5 now is different than the routing decision for the other packets. When analysing requests that works, e.g does not get stuck, even the last FIN gets properly routed. It seems like a bug in the kernel or that the routing decision is stateful in some way.

    EDIT

    One thing that could cause these problems is the following fact, the traffic is routed between the firewall and the local LAN, so the client LAN is not directly connected to the firewall through L2.

                    +---------------------------+       +------------------+                         +------------------------+
                    |                           |       |      Router      |   (   Lab network    )  |                        |
    ( Internet ) -- + eth1                 eth0 +-------+                  +-- (                  ) -+ Client 192.168.126.161 |
                    | 10.1.1.1   192.168.60.254 |       |                  |   ( 192.168.126.0/24 )  |                        |
                    +---------------------------+       +------------------+                         +------------------------+
    

    In this picture, 10.1.1.1 represents the external IP address of the firewall, all other addresses are the real IP addresses used.

    Here is the routing table on the firewall:

    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    10.1.1.0        0.0.0.0         255.255.255.240 U     0      0        0 eth1
    192.168.126.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
    192.168.60.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
    0.0.0.0         10.1.1.15       0.0.0.0         UG    0      0        0 eth1
    

    Note that 10.1.1.0 and default gw 10.1.1.15 are made up, the rest is exactly the same as used. I had to manually add the 192.168.126.0/24 route to reach the lab network from eth0 (192.168.60.254).

    Here are some extensive logs on the packet traversal for the last packet 2133 that gets rejected due to being routed to the local host (e.g the firewall).

    [16406874.374588] raw pre IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
    [16406874.374625] mangle pre IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
    [16406874.374667] mangle in IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
    [16406874.374699] filter in IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
    [16406874.374780] mangle out IN= OUT=eth1 SRC=10.1.1.1 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=0 RES=0x00 RST URGP=0 
    [16406874.374807] mangle post IN= OUT=eth1 SRC=10.1.1.1 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=0 RES=0x00 RST URGP=0 
    [16406874.378813] mangle pre IN=eth0 OUT= MAC=00:02:b3:b9:ff:b4:00:90:1a:10:0c:dd:08:00 SRC=192.168.126.161 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=63 ID=35424 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=11680 RES=0x00 ACK URGP=0 
    [16406874.378863] mangle fwd IN=eth0 OUT=eth1 SRC=192.168.126.161 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=62 ID=35424 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=11680 RES=0x00 ACK URGP=0 
    

    Once again, our fw external IP has been replaced with 10.1.1.1 and the ip of webserver outside the NAT'ed network is replaced with 172.16.1.2

    EDIT Breaking News!

    Ok last try was to DROP the RST packet, very very interesting, I added an iptables rule that dropped all RST packets destinated to the webserver we have problem requesting files from. And then it worked e.g the last FIN, ACK, PSH packet 2133 in the log above is dropped, but since the RST is dropped the webserver has time to get all the ACK's ant then decides to retransmitt the last packet, packet 2133 once again, and now it goes through the firewall since the contrack module now has seen ACK's coming back from the client and allows the last ACK, FIN packet with the final payload.

    So it is definitely a timing/window problem, this particular file, with the timing of the ACK's from the client, triggers something in conntrack that rejects the final packet from the webserver.

    So far, googling and reading Kernel doc's reveal nothing that can cause this behaviour, next step will be to read the kernel source code for the routing/conntrack module.

    PROBLEM SOLVED

    Well, at least now we know exactly what happens and have a workaround that solves the problem.

    Sergey pointed to the very valuable -m state --state INVALID matching rule which helped a lot in the debugging, I now realize that an iptables setup without an explicit rule for INVALID packets is not complete so strange behavior seems to happen sometimes.

    When enabling logging in the conntrack module for what causes the invalid packet, what happens its pretty obvious and I had my suspicion regarding this.

    [16659529.322465] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=40874 DF PROTO=TCP SPT=80 DPT=55498 SEQ=658735108 ACK=1194081763 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
    

    Once again, 172.16.1.2 is the external webserver (which behaves incorrect) and 10.1.1.1 is the external address of the firewall.

    The webserver pushes more data over the wire than the client has advertised in the receive window (conntrack is state-full and verifies this), it seems that it is when the FIN packet arrives that conntrack bails out since the receive window is actually exceeded much earlier.

    I believe that it might be caused by incorrect TCP offloading in the network card on the webserver. When I started to analyse this I took captures on the webserver and according to the tcpdump/wireshark traces jumbo frames were written by the TCP layer in the kernel which was then segmented into smaller frames with MTU=1500 by the network card. So obviously this needs to be adressed in the webserver since it is not correct TCP behaviour to send more data than the receiver has adverties in its receive window.

    Both Polynomial and Sergey provided valuable input, but Sergey pointed me to the exact behavior of the conntrack/NAT module regarding packet traversal.

    • Ladadadada
      Ladadadada over 12 years
      Do you have any REJECT statements in your iptables config ? If so, see if you can use logging to figure out which rule it is.
    • ernelli
      ernelli over 12 years
      Nope, no reject rules it seems like the REJECT happens outside iptablesm it happens during the routing decision
    • ernelli
      ernelli over 12 years
      When it works, apart from all packets being forwarded correctly to the client and no RST packets being returned, the client sends ACK's before the final FIN from the server. I have examined a dump from a large file (300k) and then there is no problems. A trace from a small file also works, final packet with FIN is forwarded. Can you please elaborate on "firewall doesn't support sliding windows"
    • polynomial
      polynomial over 12 years
      Can you expand on the connection between the router and the lab client? You have /24 netmasks on the 60 and 126 so it isn't clear how the lab clients are able to send traffic out? Is their netmask not a /24? Is some sort of proxy arp going on? Is there an alias on eth0:1 for the 126.0/24?
  • ernelli
    ernelli over 12 years
    Thanks for your effort, I can reproduce the error 100%, all requests through the firewall works except very few. Those who dont work seem to hit a precise size/timing issue. I can add multiple Wireshark traces for bigger/smaller requests between same client/webserver through the firewall which all works, but the trace above is the one that gets stuck. I have added new info above which might sched som light on the problem.
  • ernelli
    ernelli over 12 years
    /proc/sys/net/ipv4/ip_local_port_range is set to [32768 61000] netstat does not show any locally bound ports. ip_conntrack shows the connection as establised, e.g belives that it is still open since the last FIN has not been forwarded to the client. The status is packets=10 bytes=12084 [ASSURED] mark=0 secmark=0 use=1