Bash Script to find zombie processes?

6,392

You don't have to reboot when they are zombie processes. Here's why:

  • A process becomes zombie when the process is finished, but it's parent has not called wait(2) to get it's return code

  • The zombie does not take any physical or virtual resources except only an entry in the kernel's process table

  • Once the parent calls wait(2) the zombie will be properly reaped and the process table entry will be removed

  • If the zombie becomes an orphan i.e. if it's parent dies, then init (PID 1) will inherit the process and will reap it by calling wait(2)

As you can see it's a matter of time till the wait(2) is called and the zombie is reaped. If you have many zombies over the time, consider it's a programming flaw, you should look at fixing (or ask for fixing) the code instead rather than rebooting, which is absolutely unnecessary and should not be done.


To find the zombie processes, get the STATE of the process, if it's Z, then the process is a zombie:

ps -eo pid,ppid,state,cmd | awk '$3=="Z"'

Here i have taken only selective fields namely the PID, PPID, STATE and COMMAND.

Share:
6,392
cannabeatz
Author by

cannabeatz

Control Systems Programmer and budding cryptocurrency mining node admin.

Updated on September 18, 2022

Comments

  • cannabeatz
    cannabeatz over 1 year

    So recently I've noticed that I have a process that will randomly crash and become a zombie with a PPID of 1 (init). I've been told that the only way to fix this is to reboot the PC (or send SIGCHLD to init, which is....dicey/useless, from what i understand. )

    Essentially, what I'm looking to do is write a bash script that will just look for a zombie process and if there is one, reboot the PC.

    Currently, i use this script to monitor the process itself:

     ps auxw | grep ethminer | grep -v grep > /dev/null
    
     if [ $? != 0 ]
     then
        sudo reboot
     fi
    

    Now, this script seems to work fine when ethminer is either RUNNING, or NOT RUNNING; it will reboot the machine if it does not see ethminer in the process table, and it does nothing if it doesn't see it.

    However, (from my admittedly loose understanding) since there is no exit code when the process becomes a zombie if [ $? != 0 ] doesn't get any input, and therefore doesn't do anything.

    Is there anyway I can fix/modify this script so it does what i want it to do? Or am I way off track here?

    Thanks!

  • cannabeatz
    cannabeatz over 7 years
    OK, that makes sense. Thanks! Let me explain a bit better; the machine is question is a cryptocurrency mining rig, so ideally, the miner process is to run 24/7. The reason i'm concerned about this process being a zombie isnt necessarily because of resource utilization, it is that i cant figure out how to make the process restart without removing the zombie, since it already has an entry in the process table. Init doesnt seem to be properly reaping the orphan zombie either, its been two days now and the zombie still persists.
  • cannabeatz
    cannabeatz over 7 years
    So i guess the real question here is "Why is this process's parent dying?". My workaround is just sort of a band-aid.
  • heemayl
    heemayl over 7 years
    @cannabeatz Are you sure the current parent of the process is init i.e. it's original parent died? Get me the output of the command i have given.
  • cannabeatz
    cannabeatz over 7 years
    I am relatively sure of this, the PPID is 1 and the process shows up as ethminer<defunct>. I'd paste in the actual output of the command, but i just rebooted the rig again and apparently it didn't restart correctly..so now i've lost remote access. I'll have to get back with more info when i'm physically in front of the machine. Thanks for the info!
  • heemayl
    heemayl over 7 years
    @cannabeatz init reaps its child regularly, it is defined by the design. If it is not in your case, consider it as a bug in init itself. What's your init? Also try sending SIGCHLD: kill -SIGCHLD 1. if that does not work, file a bug report. And of course at first make sure that the current parent is really init, because these sort of bugs are very very rare for init.
  • cannabeatz
    cannabeatz over 7 years
    Alright, I'll give that a shot. I'm inclined to believe that my issue isn't a bug and is probably the result of some sort of usage error or poor observation on my part; I'll try to collect more info and will update when i do. Thanks again!
  • cannabeatz
    cannabeatz over 7 years
    I was able to restore remote access to my machine, and i ran the command you gave me. Turns out ethminer's PPID was NOT 1, it was another process number ( #6837, if it matters). This process's name is "sudo". Now i'm REALLY confused. From what i understand, when i run a command as sudo, two processes are started (sudo & whatever the command is called). So this means my parent sudo process is dying for some reason?
  • cannabeatz
    cannabeatz over 7 years
    Alright, If i kill the PPID that is listed when i run the above command, the PPID of the zombie becomes init. Very confused at this point.
  • heemayl
    heemayl over 7 years
    @cannabeatz This justified my words, there could not be such a major bug in init. What makes you confused? All info you need is in my answer.
  • cannabeatz
    cannabeatz over 7 years
    While I appreciate all the information, i haven't really found an answer to my issue. Allow me to explain what i am observing: Running ps -eo pid,ppid,state,cmd | awk '$3=="Z"' to find PPID of "ethminer" tells me ethminers PPID is "6837". I kill that process with sudo kill 6837, and run ps -eo pid,ppid,state,cmd | awk '$3=="Z"' again to see if i successfully killed the zombie. No luck; the output of ps -eo pid,ppid,state,cmd | awk '$3=="Z"' is now 6840 1 Z [ethminer] <defunct> which shows init has once again become the parent of ethminer.