Tomcat suddenly shuts down on its own

ubuntu-18.04 unexpected-shutdown tomcat9

8,042

As I remarked in the comments, my first idea would be that a shutdown command (SHUTDOWN) was sent on Tomcat's shutdown port (by default listening on port 8005 of the interface localhost). The usage of this method is recommended in Tomcat's documentation.

However Ubuntu (following Debian) disabled the shutdown port by default in version 8.5.32-2. Newer versions of Tomcat perform a clean shutdown, when the JVM receives a SIGTERM, so enabling the shutdown port is just a security risk.

So the main candidates for your unexpected shutdown are:

the Linux OutOfMemory killer, which logs to the kernel log (journalctl -k),
a crash of the JVM. While the virtual machine itself is pretty stable, I have seen libtcnative bring it down. Check Tomcat9's main log (/var/log/catalina.<date>.log), which under your Ubuntu is also nicely collected by systemd-journald (journalctl -u tomcat9) and sent to syslog.
Java's own OutOfMemoryError, which should also be in Tomcat9's main log.

Update: based on your syslog excerpts, someone got access to the user app. A lot of user sessions for user app where created which used up the system's resources and caused many services to be shut down.

8,042

Amos

Updated on September 18, 2022

Comments

Amos almost 2 years

I'm running tomcat 9.0.26 on Ubuntu 18.04. Suddenly I noticed that tomcat is down. Never happened before without me shutting it down. I looked at the trace and saw the following (this seems to start the shutdown process):

29-Jan-2020 07:18:27.927 INFO [Thread-4] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["https-jsse-nio2-8443"]
29-Jan-2020 07:18:27.936 INFO [Thread-4] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8009"]
29-Jan-2020 07:18:27.946 INFO [Thread-4] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]

Starting here, it's part of the shutdown process itself, not the cause, though maybe related:

29-Jan-2020 07:18:28.145 WARNING [Thread-4] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [app] appears to have started a thread named [grpc-default-worker-ELG-1-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:

this part of the above appears to have started a thread named [grpc-default-worker-ELG-1-1] changes between exceptions: grpc-default-worker-ELG-1-2 Gax-2 Gax-3 and so on...

After that I see several lines like the following:

29-Jan-2020 07:18:28.154 SEVERE [Thread-4] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [app] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@117138f9]) and a value of type [io.grpc.netty.shaded.io.netty.util.internal.InternalThreadLocalMap] (value [io.grpc.netty.shaded.io.netty.util.internal.InternalThreadLocalMap@2b40bbea]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

and lastly:

29-Jan-2020 07:18:28.262 INFO [Thread-4] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["https-jsse-nio2-8443"]
29-Jan-2020 07:18:28.305 INFO [Thread-4] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8009"]
29-Jan-2020 07:18:28.316 INFO [Thread-4] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["https-jsse-nio2-8443"]
29-Jan-2020 07:18:28.323 INFO [Thread-4] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8009"]
NOTE: Picked up JDK_JAVA_OPTIONS:  --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED

I'd appreciate any hints on where or what to look for. The shutting down without previous notice makes me wonder what happened there. Thanks

Update: I found this in /var/log/apache2 log, at the same time as the shutdown:

[Wed Jan 29 07:18:29.141852 2020] [mpm_prefork:notice] [pid 568] AH00169: caught SIGTERM, shutting down
[Wed Jan 29 07:38:12.843389 2020] [suexec:notice] [pid 513] AH01232: suEXEC mechanism enabled (wrapper: /usr/lib/apache2/suexec)
[Wed Jan 29 07:38:13.060250 2020] [mpm_prefork:notice] [pid 539] AH00163: Apache/2.4.29 (Ubuntu) mod_fcgid/2.3.9 OpenSSL/1.1.1 configured -- resuming normal operations
[Wed Jan 29 07:38:13.060350 2020] [core:notice] [pid 539] AH00094: Command line: '/usr/sbin/apache2'

In mail.log I see the following (last line's date time is suspicious), postfix is down on purpose):

Jan 29 07:38:08 app postgrey[271]: Process Backgrounded
Jan 29 07:38:08 app postgrey[271]: 2020/01/29-07:38:08 postgrey (type Net::Server::Multiplex) starting! pid(271)
Jan 29 07:38:08 app postgrey[271]: Resolved [localhost]:10023 to [::1]:10023, IPv6
Jan 29 07:38:08 app postgrey[271]: Resolved [localhost]:10023 to [127.0.0.1]:10023, IPv4
Jan 29 07:38:08 app postgrey[271]: Binding to TCP port 10023 on host ::1 with IPv6
Jan 29 07:38:08 app postgrey[271]: Binding to TCP port 10023 on host 127.0.0.1 with IPv4
Jan 29 07:38:08 app postgrey[271]: Setting gid to "122 122"
Jan 29 07:38:09 app postgrey[271]: Setting uid to "118"
Jan 29 07:38:11 app dovecot: master: Dovecot v2.2.33.2 (d6601f4ec) starting up for imap, pop3 (core dumps disabled)
Jan 29 07:38:12 app opendkim[502]: OpenDKIM Filter v2.11.0 starting (args: -P /var/run/opendkim/opendkim.pid -p inet:[email protected])

this is the only time and place that I see a reference to postgrey and I have no idea what it is.

Update 2: this is where things started falling apart:

Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#37793 (wWW.MUARaPeNimBunGiLiR.DeSA.Id): query (cache) 'wWW.MUARaPeNimBunGiLiR.DeSA.Id/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#43632 (WWW.mUarapEnImbunGIlIr.dEsA.iD): query (cache) 'WWW.mUarapEnImbunGIlIr.dEsA.iD/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#50618 (www.MuarAPEnImBUNgIliR.DEsA.id): query (cache) 'www.MuarAPEnImBUNgIliR.DEsA.id/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#40648 (WWW.muAraPeniMbungiLir.desA.iD): query (cache) 'WWW.muAraPeniMbungiLir.desA.iD/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#59108 (Www.mUARaPeniMbuNGiLIR.deSA.iD): query (cache) 'Www.mUARaPeniMbuNGiLIR.deSA.iD/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#42231 (wWW.mUARapenIMBUNgIlIr.DESA.ID): query (cache) 'wWW.mUARapenIMBUNgIlIr.DESA.ID/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#43692 (www.MUarAPeNimbUNGIliR.DeSA.ID): query (cache) 'www.MUarAPeNimbUNGIliR.DeSA.ID/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#47881 (Www.mUAraPeNIMBungiLiR.dESA.id): query (cache) 'Www.mUAraPeNIMBungiLiR.dESA.id/A/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#36046 (www.mUaraPeNimBungIliR.Desa.id): query (cache) 'www.mUaraPeNimBungIliR.Desa.id/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#49623 (wWw.MuArapenimbUNgILIR.dESA.iD): query (cache) 'wWw.MuArapenimbUNgILIR.dESA.iD/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#42436 (Www.muARaPeniMBuNGIliR.DeSa.id): query (cache) 'Www.muARaPeniMBuNGIliR.DeSa.id/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#41600 (WWw.mUARAPeNiMbunGiLIr.DEsA.iD): query (cache) 'WWw.mUARAPeNiMbunGiLIr.DEsA.iD/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#40947 (WWW.mUAraPeNIMbUNgiLir.DeSA.iD): query (cache) 'WWW.mUAraPeNIMbUNgiLir.DeSA.iD/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#47416 (WWW.MUARApENiMBunGilIR.dESA.Id): query (cache) 'WWW.MUARApENiMBunGilIR.dESA.Id/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#50269 (Www.mUarAPENImBuNgilir.dESA.id): query (cache) 'Www.mUarAPENImBuNgilir.dESA.id/AAAA/IN' denied
Jan 29 07:13:56 app named[461]: client @0x7f6cec025fa0 35.172.217.55#34727 (WWw.MUArApEniMBUngIliR.DesA.id): query (cache) 'WWw.MUArApEniMBUngIliR.DesA.id/AAAA/IN' denied
then a lot of logs similar to this: Jan 29 07:17:14 app systemd[1]: Started Session c5470 of user app.
Jan 29 07:18:26 app systemd[1]: Removed slice system-serial\x2dgetty.slice.
then a lot of logs similar to this: Jan 29 07:18:26 app systemd[1]: Stopped Session c2888 of user app.
Jan 29 07:18:27 app systemd[875]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 29 07:18:27 app systemd[875]: Reached target Shutdown.
Jan 29 07:18:27 app systemd[875]: Starting Exit the Session...
Jan 29 07:18:27 app systemd[1]: Stopped Session c3432 of user app.
Jan 29 07:18:27 app systemd[875]: Received SIGRTMIN+24 from PID 13436 (kill).
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '1' to 'kernel/kptr_restrict', ignoring: Permission denied
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '1' to 'fs/protected_hardlinks', ignoring: Permission denied
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '1' to 'fs/protected_symlinks', ignoring: Permission denied
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '176' to 'kernel/sysrq', ignoring: Permission denied
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '1' to 'net/ipv4/tcp_syncookies', ignoring: No such file or directory
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '1' to 'kernel/yama/ptrace_scope', ignoring: Permission denied
Jan 29 07:38:06 app keyboard-setup.sh[67]: Couldn't get a file descriptor referring to the console
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write '65536' to 'vm/mmap_min_addr', ignoring: Operation not permitted
Jan 29 07:38:06 app keyboard-setup.sh[67]: Couldn't get a file descriptor referring to the console
Jan 29 07:38:06 app systemd-sysctl[85]: Couldn't write 'fq_codel' to 'net/core/default_qdisc', ignoring: No such file or directory
Jan 29 07:38:06 app keyboard-setup.sh[67]: Couldn't get a file descriptor referring to the console
Jan 29 07:38:06 app keyboard-setup.sh[67]: message repeated 4 times: [ Couldn't get a file descriptor referring to the console]
Jan 29 07:38:06 app keyboard-setup.sh[67]: setupcon: We are not on the console, the console is left unconfigured.

Why do I see 100s of Stopped Session <session> of user...? and I mean 100s! is it open sessions? is it suppose to be like that? (overload?)

Also this line Jan 29 07:18:27 app systemd[875]: Received SIGRTMIN+24 from PID 13436 (kill). is VERY suspicious. Will I ever know what was PID 13436?

Piotr P. Karwasz over 4 years

If you didn't disable the shutdown port, shutting down Tomcat9 may be as simple as echo SHUTDOWN | nc <server> 8009
Amos over 4 years

And it has to be from inside the vps or also from outside? How can I know who initiate this command?
Piotr P. Karwasz over 4 years

By default the shutdown port is bound to localhost only and listening on port 8005. If the server gets stopped this way you'll get a log entry: A valid shutdown command was received via the shutdown port. Stopping the Server instance.. You can't know who issued it, so check the attributes of the <Server> element in the server.xml file.
Piotr P. Karwasz over 4 years

SIGRTMIN+24 is the signal a process sends to stop a user systemd instance.
Amos over 4 years

@PiotrP.Karwasz I decided to start my VPS from scratch, it looks like the virus got away cpu-wise but it still has remains that must affect things including this behavior
Piotr P. Karwasz over 4 years

Good idea, it's always easier to reinstall a compromised machine from scratch. One never knows which libraries and executables where substituted.
Amos over 4 years

I wouldn't say easier but definitely a better option :) I hope the downtime will be bearable. Thanks again for your support, moral and technical :)
Amos over 4 years

@PiotrP.Karwasz the new VPS has the same behavior. I read this thegeekstuff.com/2011/05/openssh-options and changed the values to 150 and 2 respectively. Now I will see if it helps in anyway. Before that I posted a question: serverfault.com/questions/1002204/…
Amos over 4 years

I did some more digging and from reading over the net, it seems to be an expected behavior and people just want to remove the logging instead of understanding why it opens/close so many times. Also no one makes a notice to the sessions not being cleared from the list... It looks like I'm missing something basic and I don't understand what it is.
Amos over 4 years

A more detailed and related question on this matter, hopefully the wisdom of serverfault will show me the way :) serverfault.com/questions/1002341/…
Piotr P. Karwasz over 4 years

Just to check: you didn't user the same weak root password and got infected again? The interesting part is why the sessions never close.
Amos over 4 years

From reading on the net, people seem to expect this behavior and just ask how to remove the logging itself (it's coming from PAM). No one checked the sessions list maybe...
Piotr P. Karwasz over 4 years

Usually the systemd session list does not grow, as sessions exit normally.

Amos over 4 years

Thank you very much! Please read my update in my original post, there was a sigterm. Can you shed some more light about it or tell me where do you recommend me to read about it?
Amos over 4 years

My main question is: how can I know what happened and what triggered this sigterm. Thanks.
Piotr P. Karwasz over 4 years

Look in the kernel logs (journalctl -k) at about the same time. Looks like the OOM Killer killed both Apache and Tomcat.
Amos over 4 years

I executed journalctl -k and got -- Logs begin at Sun 2019-02-10 00:49:46 UTC, end at Fri 2020-01-31 21:34:15 UTC. -- -- No entries --. How can I check (another method) if the problem was out of memory? in kern.log I only have this (from long time ago): Nov 4 19:22:36 domain t of memory [16157] Nov 4 19:22:36 domain t message repeated 4 times: [ of memory [16157]]
Amos over 4 years

Please see one more section in updated. from mail.log... the hour is suspicious...
Piotr P. Karwasz over 4 years

For journalctl, maybe you don't have permissions to read it (you must be root or either in the adm or systemd-journal group). So most of your services restarted. If you don't find evidence for an out of memory, check /var/log/unattended-upgrades.log.
Amos over 4 years

I don't have such log file. I do have a lot of things (for example, a lot of "Stopped Session c3044 of user..." and "Couldn't write '1' to 'kernel/kptr_restrict', ignoring: Permission denied" (and more) in syslog around that time but I can't post it all here. I wonder if there is one file that contains the reason that started all this. There are a lot of log files around :) Also, did you see my addition about postgrey? (and I'm root)
Piotr P. Karwasz over 4 years

Postgrey did restart as all other services on your server. Try adding the /var/log/syslog entries from 7:15 when everything started to crash to around 7:35. At 7:38 all your services where relaunched by systemd (except Tomcat).
Amos over 4 years

Look at update2 - I think there are some interesting stuff... I wish I knew what they meant but they look very suspicious. Thanks!
Piotr P. Karwasz over 4 years

I updated the answer.
Amos over 4 years

The username is my domain's name. The id is 1000, the "real name" column says "root" and lastly: this user's login is currently disabled (I disabled it before all this). If I understand you correctly, someone used this user to do some stuff that eventually crashed the system? How can I know if I can delete this user?
Piotr P. Karwasz over 4 years

Let us continue this discussion in chat.