What could be the cause for going to uniterruptable deep sleep state for this particular make process?

process nfs top load-average

5,719

A process in 'D' state is normally (but not always) "blocked on I/O wait". This can happen if a disk is busy and suffering high service times, for example. Process in D state count towards the load average, even though they're not using real CPU resources.

In the case of NFS, a process can spend a lot of time in 'D' state waiting for the NFS server to respond.

The default behaviour of an NFS client is to retry for up to 60 seconds (see the timeo option from man nfs) before retrying. This will mean a process may be in I/O wait for at least 60 seconds if there is a problem.

What happens then will depend on the retrans setting and the hard/soft settings.

If the filesystem is mounted hard then retries happen indefinitely; if mounted soft then the I/O request is finally failed. But we can see that this isn't immediate because of the timeo and retrans options.

Clients can see NFS issues for a number of reasons; a common one is network bandwidth (especially if you're on a WiFi network). Another one is volume of requests (if you run things in parallel then you could be causing a bottleneck). The server, itself, may be suffering from poor disk performance and so responding slow to NFS requests, or the server may not be running enough daemon threads to handle the volume of requests.

5,719

GP92

Updated on September 18, 2022

Comments

GP92 almost 2 years
I am trying to understand 'D' state correctly.

In my case, the following process went to 'D' state:
```
make -f freac/CMakeFiles/freac_objs.dir/build.make freac/CMakeFiles/freac_objs.dir/build
```
It is using NFS share.

Also the load keep on increasing. load_avg is now at 1600(40 CPUs). I think 40 is accepatable limit for 40 processors.

Ok leaving that, three things I want to know:
1. Why does the load increase when a process is in 'D' state?
2. Why does a process go to 'D' state if access to a NFS share is troublesome, instead of the process completely getting killed?
3. What could cause sudden issue in accessing NFS share (Could it be due to network in most cases?)
Thanks!
- phemmer about 8 years
  
  The metric is just called "load", not "cpu load". It's not tied to the cpu. See unix.stackexchange.com/a/116865/4358
- GP92 about 8 years
  
  @Patrick Yes, sorry I understand that. Old habits.
GP92 about 8 years

Thanks! That explains my actual question! :) It seems soft mounts is not recommendable as it may corrupt data. So, its better to use hard mount even though we face this hung issues sometimes. But still it doesn't answer my first question: why can't it kill the process instead of taking it to D state. What can it achieve by taking it to D state that it can't achive by killing it?
Stephen Harris about 8 years

All processes go into D state when doing I/O and waiting for a device to respond, whether it's a local disk or an NFS server or anything else. It's the normal process flow. If a program was killed instead of going into D state then you'd never get anything done :-) The problem with NFS is that D state times can be extended (because it depends on network I/O and remote servers and retry windows...) so you see it frequently with NFS, but it's not limited to NFS and can occur elsewhere.
GP92 about 8 years

What I understand is the process should pickup and continue from where it left when the I/O is available (i.e, NFS is accessible). However I am not sure if NFS caused this, I couldn't think of any other reason. nor I can find any info from logs.
Stephen Harris about 8 years

There may not actually be a problem; if you're doing a lot of I/O then you may just be seeing the results of a slow (compared to local disk) filesystem. If you strace the process you might see it doing things. If there is a problem then it'll typically show up as "NFS server not responding" type messages.
GP92 about 8 years

Yes, I do found this message in logs: NFS server not responding. But not in this case. It is observed for some other servers before and the rest is same, process is hung and we did reboot. But how long NFS server not responded I don't know. And here, I can't find any such messages, but only these: kernel: INFO: task make:27163 blocked for more than 120 seconds.
GP92 about 8 years

So, I guess may be not NFS issue for my case.
GP92 about 8 years

Yes, sure thanks! I already have a similar one: asked in different perspective. If you could, please also look at it once: unix.stackexchange.com/questions/287910/…. I will create a new question after filtering out rest of my confusions. :)