How to investigate unexpected Linux server shut down?

troubleshooting debian-squeeze unexpected-shutdown

107,234

Solution 1

First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).

If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.

Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.

If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.

UPDATE:

I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.

Solution 2

First, you want to check /var/log/syslog. If you are not sure what to look for, you can start by looking for the words error, panic and warning.

grep -i error /var/log/syslog

If you have system graphs available (e.g. Munin). Check them and look for abnormal patterns. If you do not have munin installed, it might be an idea to install it (apt-get install munin munin-node)

You should also check root-mail for any interesting messages that might be related to your system crash.

Other logfiles you should check is application error-logs. E.g /var/log/apache2/error.log or similiar. They might contain information leading you to the problem.

Solution 3

In my experience, an "unexpected halt" is almost always caused by overheating. Check your temperatures and fan speeds via lm_sensors and make sure that they are good.

Recently we had the same pattern: A server halted about one hour after the support manually started it. After this hours the CPU temperature hit the configured threshold in the BIOS (iirc 60 or 70°C) and halted the system. All these troubles where caused by an broken CPU fan. After replacing the fan everything returned to normal.

Solution 4

There are a number of logs files in /var/log directory (and it's subdirectories), including

/var/log/boot

and

/var/log/boot.log

Start with the files above.

Solution 5

You can find if system know about fact that it was going down with next commands

sudo last -1x reboot
sudo last -1x shutdown

If no info => then it could be lose of power or something else external

if you have info => search in logs around reboot/shutdown time

View more solutions

107,234

alfish

Updated on September 18, 2022

Comments

alfish almost 2 years

In a new Xeon 55XX server with 4xSSD at raid 10 with Debian 6, I have experienced 2 random shut downs within two weeks after the server being built. Looking at bandwidth logs before shut down does not indicate anything unusual. The server load is usually very low (about 1) and it is collocated far away.There seem to be no power outage while the server went down.

I know that I look at /var/log but not sure which logs should I investigate and what should I look for. So Appreciate your hints.
- cherouvim over 11 years
  
  Did you find what was the problem?
alfish about 12 years

The machine went off, and returned to life just after I asked the support to manually start it.
pkhamre about 12 years

If temperature is the issue, install munin to track temperature-data over time to spot trends.
Grant about 12 years

+1 to temperature issues. Had the same thing on one of my servers in a datacenter - turns out they forgot to connect one of the CPU fans when they built the system.
Pierre.Vriens about 8 years

And look for "what"?
asdmin about 8 years

That depends on the type of the failure occurred. Most of the cases, the root cause is a kernel crash, a power failure or overheat induced CPU shutdown, which means there's nobody to write an entry to the log files and flush it onto the disk, so there will be no messages there at all.
psv almost 4 years

Not sure why this is down-voted, but imo this is the best advise to find out if the system was properly shutdown or not.