Find out what NFSD processes are actually doing?

monitoring nfs ubuntu-12.04 process top

11,215

Solution 1

In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.

For example, you can use something like:

tcpdump -w filename.cap "port 2049"

to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail—the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high.

Solution 2

Couple of tools for you:

lsof shows you the open file handles
iotop shows the process-wise I/O statistics in the top manner
nethogs shows you the per-process network traffic
strace allows you to see what a process is doing

11,215

BT643

PHP Developer, tech enthusiast, general geek!

Updated on September 18, 2022

Comments

BT643 almost 2 years
When I view top on one of our servers there are a lot of nfsd processes consuming CPU:
```
PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2769  root      20   0     0    0    0 R   20  0.0   2073:14 nfsd
2774  root      20   0     0    0    0 S   19  0.0   2058:44 nfsd
2767  root      20   0     0    0    0 S   18  0.0   2092:54 nfsd
2768  root      20   0     0    0    0 S   18  0.0   2076:56 nfsd
2771  root      20   0     0    0    0 S   17  0.0   2094:25 nfsd
2773  root      20   0     0    0    0 S   14  0.0   2091:34 nfsd
2772  root      20   0     0    0    0 S   14  0.0   2083:43 nfsd
2770  root      20   0     0    0    0 S   12  0.0   2077:59 nfsd
```
How do I find out what these are actually doing? Can I see a list of files being accessed by each PID, or any more info?

We're on Ubuntu Server 12.04.

I tried nfsstat but it's not giving me much useful info about what's actually going on.

Edit - Additional stuff tried based on comments/answers:

Doing lsof -p 2774 on each of the PIDs shows the following:
```
COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
nfsd    2774 root  cwd       DIR    8,1     4096    2 /
nfsd    2774 root  rtd       DIR    8,1     4096    2 /
nfsd    2774 root  txt   unknown                      /proc/2774/exe
```
Does that mean no files are being accessed?

When I try and view a process with strace -f -p 2774 it gives me this error:
```
attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
```
A tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like:
```
13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len
```
- Ale over 9 years
  
  In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.
- BT643 over 9 years
  
  Interesting... a tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like: 13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:42‌82291716}], len
- Ale over 9 years
  
  you can use something like tcpdump -w filename.cap "port 2049" to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail -- the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high
- Ale over 9 years
  
  Answer posted :) I'm glad you solved the problem, NFS can be very tricky to debug! Especially when there is lot of activity but no actual disk access (like my over quota user).
BT643 over 9 years

For some reason I'm getting attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf even though I'm running it as root?
BT643 over 9 years

Sorry, thought I'd move my comment here before I realised you'd replied :) Thanks! I was able to track down the cause with tcpdump! It was caused by a stuck PHP script which happened to be accessing an NFS share on our second server. I don't think it was actually doing anything which is why it didn't really show in top, iotop, etc, but the amount of stuck processes on that mount seemed to be causing issues :) Thanks again!