NFS mount "hanging" when accessing from a server on a different subnet

linux mac-osx nfs openbsd pf

5,625

Solution 1

Just wanted to update on this in case anyone runs in to the same problem.

Essentially it comes down to the state rules in Pf. By default Pf keeps state, and uses S/SA as a mask. However, it seems that the NFS server implementation on OS X attempts to start a conversation back to the client using a non-standard set of flags. This was causing it to fail because Pf simply dropped the packets. I gathered this by tcpdumping both the lan and farm interfaces. After tweaking the state flags for the rules between the subnets, the connection was established correctly.

However, it seems to Pf continued to filter out some packets due to some other form of internal normalization, and no amount of tweaking the options that I tried managed to get it work.

In the end, I ended up creating another interface on the file server and placing it directly on the farm vlan, bypassing the router altogether.

Solution 2

I haven't used pf; but I think it was one of the first statefull filters. Maybe it's keeping account of the 'connections' and dropping them?

I'd look for any state-dependent filter rule. In Linux's iptables usually the filter starts with a

ACCEPT all state RELATED,ESTABLISHED

because that way it won't have to recheck all the relevant rules for each packet after the first one. But since NFS is UDP based and doesn't care about long (even hours) periods of silence, maybe the router is losing the ESTABLISHED state, and the new packets aren't valid for a start.

check if there's any 'keepalive' parameter to make the client send heartbeat packets after a minute or so of silence. if not, try NFS over TCP. (which does have heartbeat packets).

Solution 3

The first thing to do is ensure that the firewall is actually the culprit.

To do this, set your default block rules to log. On my firewalls, that's two lines at the top of the filter rules:

block in log
block out log

Wait for the NFS mount to hang again and check your log interface:

tcpdump -eeni pflog0 'host <client ip> and host <nfs server ip>'

If you're seeing these packets blocked at the firewall, please post your pf.conf. If not, we need to start looking beyond the firewall.

5,625

Rahim

I've been Linux user for over 10 years, and been computing since around age 11. I got my start in programming by learning Perl on my Debian machine in high school. Since then I've learned numerous languages including C, C++, and Python. I've also dabbled in Java and assembly. I have a bachelors degree in Computer Engineering from Simon Fraser University in Burnaby, BC, Canada. I've worked in Japan as a CAD software developer in C++ on Windows. Not my favorite environment, but an interesting project never the less. I'm currently employed by a Vancouver-based biotech startup working primarily on systems level projects but doing my fair share of software development. My favorite programming language is Python and I use it nearly every day. My recent interests lie in virtualization, hpc, and large-scale systems management. I'm an active contributor to the Bcfg2 project.

Updated on September 17, 2022

Comments

Rahim over 1 year

Here's a problem which I am at a loss to diagnose:

Our user home directories are served via NFS from an Apple XServe running Mac OS X 10.5.7. Normally they are exported to our default office subnet, "lan". Recently I have been building a new subnet, "farm". The computers on "farm" run the same OS (openSUSE 11.1, and Gentoo) as the ones on "lan", and the software versions are the same.

The problem is that when my users have been using a machine on "farm" for some time (5 minutes, sometimes 30, sometimes a full hour) the NFS mount seems to just hang. Attempting to do an ls on the directory or anything else (such as a login, etc) that tries to access the user home directory just gets stuck. Mounts to other NFS servers from the "hung" machine seem to work as expected.

There is nothing in the logs of either the client or the server that indicates any problem. The same types of clients work just fine from the default "lan" subnet.

I've tried all sorts of different configurations of the NFS server and client (disabling/enabling kerberos, different mount options) but nothing appears to make any difference.

I'm strongly suspecting some network-level problems between these two subnets, perhaps some mangling by firewall/router (OpenBSD with pf as the packet filter). The connection between the two sets of machines is fairly simple: x serve --> switch --> router --> switch --> clients

I'm pretty much at a loss as to debugging methods to try next, or what the possible solution may be. Any ideas as to how to approach this problem from this point?

Update:

Still haven't been able to resolve this. I thought I had nipped it in the bud when I disabled scrub on the internal interfaces, but the problem has manifested itself again. What's strange is that pf seems to still be modifying some packets.

An example conversation, on the farm vlan side:

09:17:39.165860 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472382:2887472382(0) win 5840 <mss 1460,sackOK,timestamp 236992843 0,nop,wscale 6> (DF)
09:17:39.166124 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 43 win 65535 <nop,nop,timestamp 316702204 236992843> (DF)
09:17:54.164490 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472385:2887472385(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.164760 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 1441270809:1441270809(0) ack 43 win 65535 (DF)
09:17:54.164776 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 4243886205:4243886205(0) ack 46 win 0 (DF)
09:17:54.164989 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:57.164066 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236997343 0,nop,wscale 6> (DF)
09:17:57.164330 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 49 win 65535 <nop,nop,timestamp 316702384 236997343> (DF)
09:18:03.163468 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236998843 0,nop,wscale 6> (DF)
09:18:03.163732 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 49 win 65535 <nop,nop,timestamp 316702444 236998843> (DF)

and the same on the lan vlan:

09:17:39.165876 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472382:2887472382(0) win 5840 <mss 1460,sackOK,timestamp 236992843 0,nop,wscale 6> (DF)
09:17:39.166110 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702204 236992843> (DF)
09:17:54.164505 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472385:2887472385(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.164740 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 1:1(0) ack 1 win 65535 (DF)
09:17:54.164745 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 2802615397:2802615397(0) ack 4 win 0 (DF)
09:17:54.165003 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.165239 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702354 236996593,sackOK,eol> (DF)
09:17:55.123665 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702363 236996593,sackOK,eol> (DF)
09:17:57.124839 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702383 236996593,sackOK,eol> (DF)
09:17:57.164082 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236997343 0,nop,wscale 6> (DF)
09:17:57.164316 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702384 236997343> (DF)
09:18:01.126690 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702423 236997343,sackOK,eol> (DF)
09:18:03.163483 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236998843 0,nop,wscale 6> (DF)
09:18:03.163717 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702444 236998843> (DF)

I should also mention that we have other NFS traffic going through this same machine, but from a different NFS server. We've been using that for years and have not had any problems there. Similarly, these XServes have been serving NFS to Linux clients on their own subnet for a long while as well and continue to do so.