NFS v3 versus v4

7,546

NFS 4.1 (minor 1) is designed to be a faster and more efficient protocol and is recommended over previous versions, especially 4.0.

This includes client-side caching, and although not relevant in this scenario, parallel-NFS (pNFS). The major change is that is that the protocol is now stateful.

http://www.netapp.com/us/communities/tech-ontap/nfsv4-0408.html

I think it is the recommended protocol when using NetApps, judging by their performance documentation. The technology is similar to Windows Vista+ opportunistic locking.

NFSv4 differs from previous versions of NFS by allowing a server to delegate specific actions on a file to a client to enable more aggressive client caching of data and to allow caching of the locking state. A server cedes control of file updates and the locking state to a client via a delegation. This reduces latency by allowing the client to perform various operations and cache data locally. Two types of delegations currently exist: read and write. The server has the ability to call back a delegation from a client should there be contention for a file. Once a client holds a delegation, it can perform operations on files whose data has been cached locally to avoid network latency and optimize I/O. The more aggressive caching that results from delegations can be a big help in environments with the following characteristics:

  • Frequent opens and closes
  • Frequent GETATTRs
  • File locking
  • Read-only sharing
  • High latency
  • Fast clients
  • Heavily loaded server with many clients
Share:
7,546

Related videos on Youtube

Kyle Hailey
Author by

Kyle Hailey

Updated on September 18, 2022

Comments

  • Kyle Hailey
    Kyle Hailey almost 2 years

    I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.

    I mount a file system

    sudo mount  -o  'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4'  toto:/test /test
    

    and then run

     dd if=/test/file  of=/dev/null bs=1024k
    

    I can read 200-400MB/s but when I change version to vers=3, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I used dtrace to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:

     nfs4_bsize
     nfs3_bsize
    

    from default 32K to 1M (on v4 I maxed at 150MB/s with 32K) I've tried tweaking

    • nfs3_max_threads
    • clnt_max_conns
    • nfs3_async_clusters

    to improve the v3 performance, but no go.

    On v3 if I run four parallel dd's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.

    dtrace code to get window sizes:

    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    #pragma D option defaultargs
    
    inline string ADDR=$$1;
    
    dtrace:::BEGIN
    {
           TITLE = 10;
           title = 0;
           printf("starting up ...\n");
           self->start = 0;
    }
    
    tcp:::send, tcp:::receive
    /   self->start == 0  /
    {
         walltime[args[1]->cs_cid]= timestamp;
         self->start = 1;
    }
    
    tcp:::send, tcp:::receive
    /   title == 0  &&
         ( ADDR == NULL || args[3]->tcps_raddr == ADDR  ) /
    {
          printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s  %8s %8s %8s  %8s %8s\n",
            "cid",
            "ip",
            "usend"    ,
            "urecd" ,
            "delta"  ,
            "send"  ,
            "recd"  ,
            "ssz"  ,
            "sscal"  ,
            "rsz",
            "rscal",
            "congw",
            "conthr",
            "flags",
            "retran"
          );
          title = TITLE ;
    }
    
    tcp:::send
    /     ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
    {
        nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
        this->delta= timestamp-walltime[args[1]->cs_cid];
        walltime[args[1]->cs_cid]=timestamp;
        this->flags="";
        this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
        printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d  %8d %8d %12d %s %d  \n",
            args[1]->cs_cid%1000,
            args[3]->tcps_raddr  ,
            args[3]->tcps_snxt - args[3]->tcps_suna ,
            args[3]->tcps_rnxt - args[3]->tcps_rack,
            this->delta/1000,
            args[2]->ip_plength - args[4]->tcp_offset,
            "",
            args[3]->tcps_swnd,
            args[3]->tcps_snd_ws,
            args[3]->tcps_rwnd,
            args[3]->tcps_rcv_ws,
            args[3]->tcps_cwnd,
            args[3]->tcps_cwnd_ssthresh,
            this->flags,
            args[3]->tcps_retransmit
          );
        this->flags=0;
        title--;
        this->delta=0;
    }
    
    tcp:::receive
    / nfs[args[1]->cs_cid] &&  ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
    {
        this->delta= timestamp-walltime[args[1]->cs_cid];
        walltime[args[1]->cs_cid]=timestamp;
        this->flags="";
        this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
        this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
        printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d  %8d %8d %12d %s %d  \n",
            args[1]->cs_cid%1000,
            args[3]->tcps_raddr  ,
            args[3]->tcps_snxt - args[3]->tcps_suna ,
            args[3]->tcps_rnxt - args[3]->tcps_rack,
            this->delta/1000,
            "",
            args[2]->ip_plength - args[4]->tcp_offset,
            args[3]->tcps_swnd,
            args[3]->tcps_snd_ws,
            args[3]->tcps_rwnd,
            args[3]->tcps_rcv_ws,
            args[3]->tcps_cwnd,
            args[3]->tcps_cwnd_ssthresh,
            this->flags,
            args[3]->tcps_retransmit
          );
        this->flags=0;
        title--;
        this->delta=0;
    }
    

    Output looks like ( not from this particular situation):

    cid              ip  usend  urecd  delta     send     recd      ssz    sscal      rsz     rscal    congw   conthr     flags   retran
      320 192.168.100.186    240      0    272      240 \             49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
      320 192.168.100.186    240      0    196          / 68          49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
      320 192.168.100.186      0      0  27445        0 \             49232      0  1049800         5  1049800         2896 ACK| 0
       24 192.168.100.177      0      0 255562          / 52          64060      0    64240         0    91980         2920 ACK|PUSH| 0
       24 192.168.100.177     52      0    301       52 \             64060      0    64240         0    91980         2920 ACK|PUSH| 0
    

    some headers

    usend - unacknowledged send bytes
    urecd - unacknowledged received bytes
    ssz - send window
    rsz - receive window
    congw - congestion window
    

    planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD

    Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.

    • Phil Hollenback
      Phil Hollenback almost 13 years
      Well for one thing nfsv4 runs on tcp by default instead of udp.
    • Kyle Hailey
      Kyle Hailey almost 13 years
      AFAIK, solaris, unlike linux, mounts tcp by default even on v3. For v3 tests I also explicitly "proto=tcp" in some of tests but had the same performance on v3 with or without including "proto=tcp"
    • polynomial
      polynomial almost 13 years
      Have you already enabled jumbo frames on the switching infrastructure and server NICs?
    • Kyle Hailey
      Kyle Hailey almost 13 years
      yes, jumbo frames are set up, and verified. With dtrace I can see the packet sizes.
    • Zubair
      Zubair almost 13 years
      You might want to review documentation about the protocol differences between the two and see if anything jumps out at you.
    • janneb
      janneb almost 13 years
      Actually, Linux also defaults to mounting with tcp
    • pfo
      pfo almost 13 years
      You need to provide the results what you've measured with Dtrace for anyone to make more sense out of your problem. NFSv4 should not provide more throughput or anything performance related - if then it should be marginally slower. In fact Sun used to recommend in 2010AD that one should use NFSv3 if performance is the main goal. As a side note: what is the file system that is exported?
    • Kyle Hailey
      Kyle Hailey almost 13 years
      I've used both zfs and ufs for the tests - results were the same in both cases.
    • Kyle Hailey
      Kyle Hailey almost 13 years
      dtrace code posted with question now
  • Kyle Hailey
    Kyle Hailey almost 13 years
    Thanks for the pointers on NFS 4.1 though I AFAIK they we are on 4.0