Find out what NFSD processes are actually doing?

11,215

Solution 1

In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.

For example, you can use something like:

tcpdump -w filename.cap "port 2049"

to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail—the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high.

Solution 2

Couple of tools for you:

  • lsof shows you the open file handles
  • iotop shows the process-wise I/O statistics in the top manner
  • nethogs shows you the per-process network traffic
  • strace allows you to see what a process is doing
Share:
11,215

Related videos on Youtube

BT643
Author by

BT643

PHP Developer, tech enthusiast, general geek!

Updated on September 18, 2022

Comments

  • BT643
    BT643 almost 2 years

    When I view top on one of our servers there are a lot of nfsd processes consuming CPU:

    PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    2769  root      20   0     0    0    0 R   20  0.0   2073:14 nfsd
    2774  root      20   0     0    0    0 S   19  0.0   2058:44 nfsd
    2767  root      20   0     0    0    0 S   18  0.0   2092:54 nfsd
    2768  root      20   0     0    0    0 S   18  0.0   2076:56 nfsd
    2771  root      20   0     0    0    0 S   17  0.0   2094:25 nfsd
    2773  root      20   0     0    0    0 S   14  0.0   2091:34 nfsd
    2772  root      20   0     0    0    0 S   14  0.0   2083:43 nfsd
    2770  root      20   0     0    0    0 S   12  0.0   2077:59 nfsd
    

    How do I find out what these are actually doing? Can I see a list of files being accessed by each PID, or any more info?

    We're on Ubuntu Server 12.04.

    I tried nfsstat but it's not giving me much useful info about what's actually going on.

    Edit - Additional stuff tried based on comments/answers:

    Doing lsof -p 2774 on each of the PIDs shows the following:

    COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
    nfsd    2774 root  cwd       DIR    8,1     4096    2 /
    nfsd    2774 root  rtd       DIR    8,1     4096    2 /
    nfsd    2774 root  txt   unknown                      /proc/2774/exe
    

    Does that mean no files are being accessed?


    When I try and view a process with strace -f -p 2774 it gives me this error:

    attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
    Could not attach to process.  If your uid matches the uid of the target
    process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
    again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
    

    A tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like:

    13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len
    
    • Ale
      Ale over 9 years
      In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.
    • BT643
      BT643 over 9 years
      Interesting... a tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like: 13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:42‌​82291716}], len
    • Ale
      Ale over 9 years
      you can use something like tcpdump -w filename.cap "port 2049" to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail -- the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high
    • Ale
      Ale over 9 years
      Answer posted :) I'm glad you solved the problem, NFS can be very tricky to debug! Especially when there is lot of activity but no actual disk access (like my over quota user).
  • BT643
    BT643 over 9 years
    For some reason I'm getting attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf even though I'm running it as root?
  • BT643
    BT643 over 9 years
    Sorry, thought I'd move my comment here before I realised you'd replied :) Thanks! I was able to track down the cause with tcpdump! It was caused by a stuck PHP script which happened to be accessing an NFS share on our second server. I don't think it was actually doing anything which is why it didn't really show in top, iotop, etc, but the amount of stuck processes on that mount seemed to be causing issues :) Thanks again!