"service mysqld stop" times out (then found out that "mysqld dead but subsys locked")

7,355

First things first: a very well-done, systematic and thorough debugging, good job.

On my RHEL 5.6 box I always get a return code of 1 if I try to kill a non-existing pid. I tried as both root and a non-privileged user, both with full path and a with just the command name. I also get only terse kill XXX: No such process, with no elaborate error messages.

It may be a good idea to run rpm -Vv util-linux and see if somebody didn't replace /bin/kill with a new and improved version. Even if rpm verification says the file is pristine, I'd try renaming /bin/kill and copying over a binary from a working machine. If the file replacement helps and you don't uncover a legitimate the source of the change, then regardless of output of rpm verification I'd assume the machine was compromised.

Share:
7,355

Related videos on Youtube

lamp_scaler
Author by

lamp_scaler

Updated on September 18, 2022

Comments

  • lamp_scaler
    lamp_scaler over 1 year

    I installed mysql and server via yum on my 64bit CentOS 5 server. It starts up fine but when I try to stop it it stalls out and then I have to "Ctrl-C" it. Then I run a "service mysqld status" and it shows:

    mysqld dead but subsys locked
    

    I run ps aux and mysql is nowhere to be found. Starting mysqld again via "service mysqld start" works fine. Attempting to stop it creates the same problem.

    I then realized that /var/lock/subsys/mysqld still exist. When running mysqld I checked /var/run/mysqld/mysqld.pid and it matched with the pid of the running service.

    I tried reinstalling mysql and deleting all files and configs but to no avail.

    What to do?

    EDIT:

    I added some echo statements in the /etc/init.d/mysqld file, particularly in the stop function:

    stop(){
            if [ ! -f "$mypidfile" ]; then
                # not running; per LSB standards this is "ok"
                action $"Stopping $prog: " /bin/true
                return 0
            fi  
            echo "beginning stop sequence"
            MYSQLPID=`cat "$mypidfile"`
            if [ -n "$MYSQLPID" ]; then
                /bin/kill "$MYSQLPID" >/dev/null 2>&1
                echo "killing pid $MYSQLPID"
                ret=$?
                if [ $ret -eq 0 ]; then
                    echo "return code $ret after kill attempt"
                    TIMEOUT="$STOPTIMEOUT"
                    echo "timeout is set to $STOPTIMEOUT"
                    while [ $TIMEOUT -gt 0 ]; do
                        /bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break
                        sleep 1
                        let TIMEOUT=${TIMEOUT}-1
                        echo "timeout is now $TIMEOUT"
                    done
                    if [ $TIMEOUT -eq 0 ]; then
                        echo "Timeout error occurred trying to stop MySQL Daemon."
                        ret=1
                        action $"Stopping $prog: " /bin/false
                    else
                        echo "attempting to del lockfile: $lockfile"
                        rm -f $lockfile
                        rm -f "$socketfile"
                        action $"Stopping $prog: " /bin/true
                    fi
                else
                    action $"Stopping $prog: " /bin/false
                fi
            else
                # failed to read pidfile, probably insufficient permissions
                action $"Stopping $prog: " /bin/false
                ret=4
            fi
            return $ret
    }
    

    This is the result I get when I try to stop the service:

    [root@server]# service mysqld stop
    beginning stop sequence
    killing pid 9145
    return code 0 after kill attempt
    timeout is set to 60
    timeout is now 59
    timeout is now 58
    timeout is now 57
    timeout is now 56
    timeout is now 55
    timeout is now 54
    timeout is now 53
    timeout is now 52
    timeout is now 51
    timeout is now 50
    timeout is now 49
    

    From looking at the code it seems to me that it will never break out of that while loop and will not be able to delete the lock file. Am I interpreting this wrong? I checked the same file on my other server and it uses the same code. I am dumbfounded.

    EDIT: In the while loop part

     /bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break
    

    For some reason it is not recognizing the return code. When service mysqld stop is called, the process has already been killed but not sure why it isn't allowing the loop to break out.

    EDIT: Further testing shows some weird behavior between calling /bin/kill and just calling kill, they apparently return different codes, why??????:

    [root@server]# /bin/kill 25200
    kill 25200: No such process
    [user@server]# echo ${?}
    0
    [root@server]# kill 25200
    -bash: kill: (25200) - No such process
    [root@server]# echo ${?}
    1
    

    EDIT: I logged on as a non-root user and tried executing "kill" and "/bin/kill" with a surprising results:

    [notroot@server ~]$ kill -0 23232
    -bash: kill: (23232) - No such process
    [notroot@server ~]$ echo $?
    1
    [notroot@server ~]$ /bin/kill -0 23232
    kill 23232: No such process
    (No info could be read for "-p": geteuid()=501 but you should be root.)
    [notroot@server ~]$ echo $?
    0
    

    The "No info could be read" error does not show up in my other servers when executing kill and bin/kill as a nonroot user.

    EDIT: Added logging described by quanta and also checked out the mysql log:

    After a start and stop, mysql log shows this:

    110918 00:11:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    110918  0:11:28 [Note] Plugin 'FEDERATED' is disabled.
    110918  0:11:28  InnoDB: Initializing buffer pool, size = 16.0M
    110918  0:11:28  InnoDB: Completed initialization of buffer pool
    110918  0:11:29  InnoDB: Started; log sequence number 0 44233
    110918  0:11:29 [Warning] 'user' entry 'root@server' ignored in --skip-name-resolve mode.
    110918  0:11:29 [Warning] 'user' entry '@server' ignored in --skip-name-resolve mode.
    110918  0:11:29 [Note] Event Scheduler: Loaded 0 events
    110918  0:11:29 [Note] /usr/libexec/mysqld: ready for connections.
    Version: '5.1.58-ius'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Distributed by The IUS Community Project
    110918  0:11:34 [Note] /usr/libexec/mysqld: Normal shutdown
    
    110918  0:11:34 [Note] Event Scheduler: Purging the queue. 0 events
    110918  0:11:34  InnoDB: Starting shutdown...
    110918  0:11:39  InnoDB: Shutdown completed; log sequence number 0 44233
    110918  0:11:39 [Note] /usr/libexec/mysqld: Shutdown complete
    
    110918 00:11:39 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
    

    Then in the tmp/mysql.log:

    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    kill 23080: No such process
    

    I stopped the stop process midway so I don't have to wait for the timeout. Looks like the process was killed. The problem, I think, is still with the different return codes from "kill" and "/bin/kill"

    • Greg Petersen
      Greg Petersen over 12 years
      Replace >/dev/null at the first attempt to kill with >/tmp/mysql.log and the second in the while loop with >>/tmp/mysql.log and try again to see what the logs say. Also take a look at /var/log/mysqld.log.
    • lamp_scaler
      lamp_scaler over 12 years
      Added the results above. I still believe something weird is happening with the kill and /bin/kill not returning the same exit codes.
    • Greg Petersen
      Greg Petersen over 12 years
      The first attempt returns 0 because you had switched to normal user from root. Are you always get return status 0 when trying /bin/kill with normal user? What is the output when running in debug mode sh -x /etc/init.d/mysqld stop?
    • lamp_scaler
      lamp_scaler over 12 years
      As you see from one of my edits above, running /bin/kill as a non-root user returns this error: (No info could be read for "-p": geteuid()=501 but you should be root.). But this does not happen in my other server when I run /bin/kill as a non-root user.
    • fxmtor
      fxmtor over 12 years
      Bash has a builtin command kill. This is helpful in case the system is swamped enough that it can't spawn any more processes, then you can still kill rogue processes if you're logged in. Thus, when you run kill(as opposed to /bin/kill), then you're running the builtin. This should explain the difference between running kill and /bin/kill, where you call a specific binary.
  • lamp_scaler
    lamp_scaler over 12 years
    Looks like something is up with /bin/kill:. S.5....T /bin/kill
  • lamp_scaler
    lamp_scaler over 12 years
    Did a quick yum update util-linux and all is well now. You, sir, are a gentleman AND a scholar.
  • Paweł Brodacki
    Paweł Brodacki over 12 years
    Thank you for your courteous words. I think rpm -aV may be in order. util-linux may be not the only affected package, and whoever changed the kill binary doesn't seem to manipulate rpm's base as well.