debugging JBoss 100% CPU usage

9,862

Solution 1

You can send a SIGQUIT signal to the running JVM to get each thread's stacktraces printed to stdout. This doesn't kill the process, though I think it does put all threads to sleep while the stack traces are being printed.

Then, correlate the thread IDs listed with your preferred method of seeing thread-wise CPU utilization. prstat -L for Solaris, top -H for Linux. Note that the tid's in the java stack traces are printed in hexadecimal; you'll probably have to convert to decimal when comparing to top or prstat output.

Solution 2

I do a thread dump. However, on my Production systems this cannot be done unless the JVM is started with certain parameters that we would never enable in Production. In this case, I use the JMX console's jboss.system:type=ServerInfo mbean to do the thread dump (listThreadDump()).

The thread dump output is mostly meaningless to me when I haven't written the code. But the person who wrote the code may be able to make sense of it. In these cases where thread dumps don't help, I prefer using "strace -fp <PID of JBoss' java process> -o outfile.txt" to have another view into what is happening on a system-call level. It's a bit like drinking from a firehose, but sometimes it helps.

Share:
9,862
Nate
Author by

Nate

Updated on September 18, 2022

Comments

  • Nate
    Nate over 1 year

    We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

    At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

    The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

    I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

    Where do we go from here? How can we narrow down the problem?

    Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

    We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

    Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

    Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

    Any other ideas?

    Thanks!

    • bart van deenen
      bart van deenen about 14 years
      I think that this question will probably get more attention at stackoverflow.com than here.
  • Mircea Vutcovici
    Mircea Vutcovici about 14 years
    Java is multithreading and strace will not connect to all threads if the process is already started. However it is possible to find the thread that uses 100% CPU by using top -Hp <java_PID> and then after finding the PIO of thread run strace -fp <thread_pid> -o outfile.txt
  • Nate
    Nate about 14 years
    I have the tools ready to debug the problem next time it happens in production. This client uses Windows (don't ask). I ended up using CDB, a Windows tool, for getting all time thread CPU usage and native IDs. I have a script to run this twice with 10 seconds between runs, the threads that change the most are the culprits. Then I run jstack from the JDK to get the thread stacktraces, including native IDs. Now we just need production to chowder again! :)
  • Dave
    Dave over 5 years
    @curious_george, what's the command you type on the Jboss/Wildfly CLI tool that gives you the thread dump? I tried "listThreadDump()" as you list, but got the error, "'listThreadDump()' is not a valid operation name."
  • Beehive Software Consultants
    Beehive Software Consultants over 5 years
    @Dave: Sorry, I can't remember. I made the comment over 8 years ago.