NFS v3 versus v4
NFS 4.1 (minor 1) is designed to be a faster and more efficient protocol and is recommended over previous versions, especially 4.0.
This includes client-side caching, and although not relevant in this scenario, parallel-NFS (pNFS). The major change is that is that the protocol is now stateful.
http://www.netapp.com/us/communities/tech-ontap/nfsv4-0408.html
I think it is the recommended protocol when using NetApps, judging by their performance documentation. The technology is similar to Windows Vista+ opportunistic locking.
NFSv4 differs from previous versions of NFS by allowing a server to delegate specific actions on a file to a client to enable more aggressive client caching of data and to allow caching of the locking state. A server cedes control of file updates and the locking state to a client via a delegation. This reduces latency by allowing the client to perform various operations and cache data locally. Two types of delegations currently exist: read and write. The server has the ability to call back a delegation from a client should there be contention for a file. Once a client holds a delegation, it can perform operations on files whose data has been cached locally to avoid network latency and optimize I/O. The more aggressive caching that results from delegations can be a big help in environments with the following characteristics:
- Frequent opens and closes
- Frequent GETATTRs
- File locking
- Read-only sharing
- High latency
- Fast clients
- Heavily loaded server with many clients
Related videos on Youtube
Kyle Hailey
Updated on September 18, 2022Comments
-
Kyle Hailey almost 2 years
I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.
I mount a file system
sudo mount -o 'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4' toto:/test /test
and then run
dd if=/test/file of=/dev/null bs=1024k
I can read 200-400MB/s but when I change version to
vers=3
, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I useddtrace
to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:nfs4_bsize nfs3_bsize
from default 32K to 1M (on v4 I maxed at 150MB/s with 32K) I've tried tweaking
- nfs3_max_threads
- clnt_max_conns
- nfs3_async_clusters
to improve the v3 performance, but no go.
On v3 if I run four parallel
dd
's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.dtrace code to get window sizes:
#!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline string ADDR=$$1; dtrace:::BEGIN { TITLE = 10; title = 0; printf("starting up ...\n"); self->start = 0; } tcp:::send, tcp:::receive / self->start == 0 / { walltime[args[1]->cs_cid]= timestamp; self->start = 1; } tcp:::send, tcp:::receive / title == 0 && ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) / { printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", "cid", "ip", "usend" , "urecd" , "delta" , "send" , "recd" , "ssz" , "sscal" , "rsz", "rscal", "congw", "conthr", "flags", "retran" ); title = TITLE ; } tcp:::send / ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ this->delta= timestamp-walltime[args[1]->cs_cid]; walltime[args[1]->cs_cid]=timestamp; this->flags=""; this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags); printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d %8d %8d %12d %s %d \n", args[1]->cs_cid%1000, args[3]->tcps_raddr , args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, this->delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, this->flags, args[3]->tcps_retransmit ); this->flags=0; title--; this->delta=0; } tcp:::receive / nfs[args[1]->cs_cid] && ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) / { this->delta= timestamp-walltime[args[1]->cs_cid]; walltime[args[1]->cs_cid]=timestamp; this->flags=""; this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags); this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags); printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d %8d %8d %12d %s %d \n", args[1]->cs_cid%1000, args[3]->tcps_raddr , args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, this->delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, this->flags, args[3]->tcps_retransmit ); this->flags=0; title--; this->delta=0; }
Output looks like ( not from this particular situation):
cid ip usend urecd delta send recd ssz sscal rsz rscal congw conthr flags retran 320 192.168.100.186 240 0 272 240 \ 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0 320 192.168.100.186 240 0 196 / 68 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0 320 192.168.100.186 0 0 27445 0 \ 49232 0 1049800 5 1049800 2896 ACK| 0 24 192.168.100.177 0 0 255562 / 52 64060 0 64240 0 91980 2920 ACK|PUSH| 0 24 192.168.100.177 52 0 301 52 \ 64060 0 64240 0 91980 2920 ACK|PUSH| 0
some headers
usend - unacknowledged send bytes urecd - unacknowledged received bytes ssz - send window rsz - receive window congw - congestion window
planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD
Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.
-
Phil Hollenback almost 13 yearsWell for one thing nfsv4 runs on tcp by default instead of udp.
-
Kyle Hailey almost 13 yearsAFAIK, solaris, unlike linux, mounts tcp by default even on v3. For v3 tests I also explicitly "proto=tcp" in some of tests but had the same performance on v3 with or without including "proto=tcp"
-
polynomial almost 13 yearsHave you already enabled jumbo frames on the switching infrastructure and server NICs?
-
Kyle Hailey almost 13 yearsyes, jumbo frames are set up, and verified. With dtrace I can see the packet sizes.
-
Zubair almost 13 yearsYou might want to review documentation about the protocol differences between the two and see if anything jumps out at you.
-
janneb almost 13 yearsActually, Linux also defaults to mounting with tcp
-
pfo almost 13 yearsYou need to provide the results what you've measured with Dtrace for anyone to make more sense out of your problem. NFSv4 should not provide more throughput or anything performance related - if then it should be marginally slower. In fact Sun used to recommend in 2010AD that one should use NFSv3 if performance is the main goal. As a side note: what is the file system that is exported?
-
Kyle Hailey almost 13 yearsI've used both zfs and ufs for the tests - results were the same in both cases.
-
Kyle Hailey almost 13 yearsdtrace code posted with question now
-
Kyle Hailey almost 13 yearsThanks for the pointers on NFS 4.1 though I AFAIK they we are on 4.0