What is causing `input/output` errors when reading from NFS v4 on CentOS?

6,594

Solution 1

The problem appears to be related to duplicate local IPs behind docker hosts. Docker assigns two containers the same internal IP (e.g. 172.17.0.4) the NFS server can't figure out which client to respond to, taking out both clients in some cases. It's apparently a long-existing issue in the RHEL implementation as I was able to find a bug report documenting this in Centos 6 (currently still effecting me in CentOS 7.3).

Solution 2

I found this searching for a solution to my own input/output error issues with a shared NFS mount. I was mounting a shared NFS drive on several machines, reading and writing with PHP. I was getting sporadic, but frequent, errors like this. I don't know if what I did fixed it, but on the off chance it helps someone else with the same problem ...

So, I was creating worker servers by cloning them. This resulted in them all having the same hostname. I didn't think anything of that, the hostname wasn't something that affected what I was doing, as far as I could tell. I change the hostnames to all be unique, and made sure the /etc/hosts file included the hostname pointing to 127.0.0.1, and the NFS errors haven't come back since.

Share:
6,594

Related videos on Youtube

editor
Author by

editor

Updated on September 18, 2022

Comments

  • editor
    editor almost 2 years

    We're seeing apps like nginx and php-fpm error out occasionally (and temporarily) while opening good files from a connected NFS mount:

    php-fpm error example:

    2017/05/20 22:53:09 [error] 55#0: *6575 FastCGI sent in stderr: "PHP message: PHP Warning:  getimagesize(/www/newspaperfoundation.org/html/wp-content/blogs.dir/22/files/2017/05/19-highest-honors-1.jpg): failed to open stream: Input/output error in /www/newspaperfoundation.org/html/wp-content/plugins/mashsharer/includes/header-meta-tags.php on line 271" while reading response header from upstream, client:
    192.168.255.34, server: www.dailyrepublic.com, request: "GET /solano-news/fairfield/highest-honors-commends-students-with-4-0-and-higher-grade-point-average/ HTTP/1.1", upstream: "fastcgi://172.17.0.3:9001", host: "www.dailyrepublic.com"
    

    nginx error example:

    2017/05/20 23:22:32 [crit] 56#0: *712 open() "/www/newspaperfoundation.org/html/wp-content/blogs.dir/24/files/2017/05/Tandem1W-550x550.jpg" failed (5: Input/output error), client: 192.168.255.34, server: www.davisenterprise.com, request: "GET /files/2017/05/Tandem1W-550x550.jpg HTTP/1.1", host: "www.davisenterprise.com", referrer: "http://www.davisenterprise.com/"
    

    During a temporary error, I can ls and see the file exists with correct permissions. The image eventually becomes OK after a long while. Other files return OK without input/output errors.

    There's not much logging I can find to document the issue. But enabling rpcdebug I see a lot of messages like these around the time of errors:

    May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)
    May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
    May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
    May 20 16:10:07 tomentella kernel: nfsd_dispatch: vers 4 proc 1
    May 20 16:10:07 tomentella kernel: nfsv4 compound op #1/5: 22 (OP_PUTFH)
    May 20 16:10:07 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
    May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #1: 22: status 0
    May 20 16:10:07 tomentella kernel: nfsv4 compound op #2/5: 18 (OP_OPEN)
    May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)
    May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
    May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
    May 20 16:10:08 tomentella kernel: nfsd_dispatch: vers 4 proc 1
    May 20 16:10:08 tomentella kernel: nfsv4 compound op #1/4: 22 (OP_PUTFH)
    May 20 16:10:08 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
    May 20 16:10:08 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 4 #1: 22: status 0
    May 20 16:10:08 tomentella kernel: nfsv4 compound op #2/4: 15 (OP_LOOKUP)
    

    In particular, I feel like I only see this message for files that are erroring out:

    May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)
    

    Any ideas on what might be causing the input/output errors?

    Client mounts using the following:

    mount.nfs4 -v -o proto=tcp $NFSMASTERHOST:/srv/data /srv/data

    Centos 7 with updated packages. The error is "new" with few server changes recently. I think perhaps my recent update to system packages may have been the trigger for this change.

    Because the problem goes in and out for some images, I'm able to somewhat watch the logs and compare/contrast. Here's an example of it going from OK to bad when grepping on a particular image name:

    May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open_confirm on file Ron-Thomas-web-150x150.jpg
    May 20 18:38:37 tomentella kernel: NFSD: nfsd4_close on file Ron-Thomas-web-150x150.jpg
    May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
    

    Here's nfsstat

    tomentella ★ ~ $ nfsstat
    Server rpc stats:
    calls      badcalls   badclnt    badauth    xdrcall
    94437487   6          6          0          0       
    
    Server nfs v4:
    null         compound     
    503       0% 94436978 99% 
    
    Server nfs v4 operations:
    op0-unused   op1-unused   op2-future   access       close        commit       
    0         0% 0         0% 0         0% 11213689  3% 2631554   0% 3377      0% 
    create       delegpurge   delegreturn  getattr      getfh        link         
    579       0% 0         0% 0         0% 88581315 31% 32460559 11% 0         0% 
    lock         lockt        locku        lookup       lookup_root  nverify      
    365       0% 0         0% 365       0% 30058556 10% 0         0% 0         0% 
    open         openattr     open_conf    open_dgrd    putfh        putpubfh     
    2771686   0% 0         0% 74326     0% 0         0% 92969992 32% 0         0% 
    putrootfh    read         readdir      readlink     remove       rename       
    2435      0% 1999675   0% 1917567   0% 350       0% 12404     0% 5072      0% 
    renew        restorefh    savefh       secinfo      setattr      setcltid     
    1226801   0% 0         0% 5072      0% 0         0% 18315216  6% 121025    0% 
    setcltidconf verify       write        rellockowner bc_ctl       bind_conn    
    121105    0% 0         0% 115189    0% 365       0% 0         0% 0         0% 
    exchange_id  create_ses   destroy_ses  free_stateid getdirdeleg  getdevinfo   
    0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
    getdevlist   layoutcommit layoutget    layoutreturn secinfononam sequence     
    0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
    set_ssv      test_stateid want_deleg   destroy_clid reclaim_comp 
    0         0% 0         0% 0         0% 0         0% 0         0% 
    
    Client rpc stats:
    calls      retrans    authrefrsh
    0          0          0       
    
    • kofemann
      kofemann about 7 years
      the error code 10011 corresponds to error EXPIRED. The reason can be client freeze for longer than 60 sec, network issues, unexpected time jumps o bug in the server or client.