Bash Script to find zombie processes?
You don't have to reboot
when they are zombie processes. Here's why:
A process becomes zombie when the process is finished, but it's parent has not called
wait(2)
to get it's return codeThe zombie does not take any physical or virtual resources except only an entry in the kernel's process table
Once the parent calls
wait(2)
the zombie will be properly reaped and the process table entry will be removedIf the zombie becomes an orphan i.e. if it's parent dies, then
init
(PID 1) will inherit the process and will reap it by callingwait(2)
As you can see it's a matter of time till the wait(2)
is called and the zombie is reaped. If you have many zombies over the time, consider it's a programming flaw, you should look at fixing (or ask for fixing) the code instead rather than reboot
ing, which is absolutely unnecessary and should not be done.
To find the zombie processes, get the STATE
of the process, if it's Z
, then the process is a zombie:
ps -eo pid,ppid,state,cmd | awk '$3=="Z"'
Here i have taken only selective fields namely the PID, PPID, STATE and COMMAND.
cannabeatz
Control Systems Programmer and budding cryptocurrency mining node admin.
Updated on September 18, 2022Comments
-
cannabeatz over 1 year
So recently I've noticed that I have a process that will randomly crash and become a zombie with a PPID of 1 (init). I've been told that the only way to fix this is to reboot the PC (or send SIGCHLD to init, which is....dicey/useless, from what i understand. )
Essentially, what I'm looking to do is write a bash script that will just look for a zombie process and if there is one, reboot the PC.
Currently, i use this script to monitor the process itself:
ps auxw | grep ethminer | grep -v grep > /dev/null if [ $? != 0 ] then sudo reboot fi
Now, this script seems to work fine when ethminer is either RUNNING, or NOT RUNNING; it will reboot the machine if it does not see ethminer in the process table, and it does nothing if it doesn't see it.
However, (from my admittedly loose understanding) since there is no exit code when the process becomes a zombie
if [ $? != 0 ]
doesn't get any input, and therefore doesn't do anything.Is there anyway I can fix/modify this script so it does what i want it to do? Or am I way off track here?
Thanks!
-
cannabeatz over 7 yearsOK, that makes sense. Thanks! Let me explain a bit better; the machine is question is a cryptocurrency mining rig, so ideally, the miner process is to run 24/7. The reason i'm concerned about this process being a zombie isnt necessarily because of resource utilization, it is that i cant figure out how to make the process restart without removing the zombie, since it already has an entry in the process table. Init doesnt seem to be properly reaping the orphan zombie either, its been two days now and the zombie still persists.
-
cannabeatz over 7 yearsSo i guess the real question here is "Why is this process's parent dying?". My workaround is just sort of a band-aid.
-
heemayl over 7 years@cannabeatz Are you sure the current parent of the process is
init
i.e. it's original parent died? Get me the output of the command i have given. -
cannabeatz over 7 yearsI am relatively sure of this, the PPID is 1 and the process shows up as ethminer<defunct>. I'd paste in the actual output of the command, but i just rebooted the rig again and apparently it didn't restart correctly..so now i've lost remote access. I'll have to get back with more info when i'm physically in front of the machine. Thanks for the info!
-
heemayl over 7 years@cannabeatz
init
reaps its child regularly, it is defined by the design. If it is not in your case, consider it as a bug ininit
itself. What's yourinit
? Also try sendingSIGCHLD
:kill -SIGCHLD 1
. if that does not work, file a bug report. And of course at first make sure that the current parent is reallyinit
, because these sort of bugs are very very rare forinit
. -
cannabeatz over 7 yearsAlright, I'll give that a shot. I'm inclined to believe that my issue isn't a bug and is probably the result of some sort of usage error or poor observation on my part; I'll try to collect more info and will update when i do. Thanks again!
-
cannabeatz over 7 yearsI was able to restore remote access to my machine, and i ran the command you gave me. Turns out ethminer's PPID was NOT 1, it was another process number ( #6837, if it matters). This process's name is "sudo". Now i'm REALLY confused. From what i understand, when i run a command as sudo, two processes are started (sudo & whatever the command is called). So this means my parent sudo process is dying for some reason?
-
cannabeatz over 7 yearsAlright, If i kill the PPID that is listed when i run the above command, the PPID of the zombie becomes init. Very confused at this point.
-
heemayl over 7 years@cannabeatz This justified my words, there could not be such a major bug in
init
. What makes you confused? All info you need is in my answer. -
cannabeatz over 7 yearsWhile I appreciate all the information, i haven't really found an answer to my issue. Allow me to explain what i am observing: Running
ps -eo pid,ppid,state,cmd | awk '$3=="Z"'
to find PPID of "ethminer" tells me ethminers PPID is "6837". I kill that process withsudo kill 6837
, and runps -eo pid,ppid,state,cmd | awk '$3=="Z"'
again to see if i successfully killed the zombie. No luck; the output ofps -eo pid,ppid,state,cmd | awk '$3=="Z"'
is now6840 1 Z [ethminer] <defunct>
which shows init has once again become the parent of ethminer.