How to determine cause of system crash?

linux centos server-crashes

84,749

Solution 1

If you have crashkernel/kdump installed and enabled, you should be able to examine the crashed kernel with relative easy using the crash utility. For example, presuming that you crashed kernel dumps are saved under /var/crash: crash /var/crash/2009-07-17-10\:36/vmcore /usr/lib/debug/lib/modules/uname -r/vmlinux.

Give a look here and here for added details.

Solution 2

You could check the dmesg file at /var/log/dmesg, which is logging the kernel messages. The messages log is just logging service and application messages and if you have a kernel error, the services and applications will just stop running, but the kernel error is still logged in dmesg.

Solution 3

bios memory test
bios hard drive test
Check smart drive log smartctl /dev/sda -a
Smart drive tests
leave dmesg -wH running in a window

84,749

Nahydrin

Programming Language Experience Java PHP Web HTML CSS JavaScript .NET C# ASP.NET VB.NET Database Experience MariaDB/MySQL MSSQL Other Experience Computer Engineering Computer Security Computer Networking Surveillance Fire Alarms Property Security Server Operation and Maintenance I play Minecraft, watch Anime and read Manga/Manhwa in my free time.

Updated on September 18, 2022

Comments

Nahydrin almost 2 years

My server crashes about once a week and does not leave any kind of clue as to what's causing it. I have checked /var/log/messages and it just stops recording at some point and starts at the computer post information when I perform a hard reboot.

Is there something I can check or software I can install that can determine the cause?

I'm running CentOS 7.

Here is the only error/problem in my /var/log/dmesg: https://paste.netcoding.net/cosisiloji.log

[    3.606936] md: Waiting for all devices to be available before autodetect
[    3.606984] md: If you don't use raid, use raid=noautodetect
[    3.607085] md: Autodetecting RAID arrays.
[    3.608309] md: Scanned 6 and added 6 devices.
[    3.608362] md: autorun ...
[    3.608412] md: considering sdc2 ...
[    3.608464] md:  adding sdc2 ...
[    3.608516] md: sdc1 has different UUID to sdc2
[    3.608570] md:  adding sdb2 ...
[    3.608620] md: sdb1 has different UUID to sdc2
[    3.608674] md:  adding sda2 ...
[    3.608726] md: sda1 has different UUID to sdc2
[    3.608944] md: created md2
[    3.608997] md: bind<sda2>
[    3.609058] md: bind<sdb2>
[    3.609116] md: bind<sdc2>
[    3.609175] md: running: <sdc2><sdb2><sda2>
[    3.609548] md/raid1:md2: active with 3 out of 3 mirrors
[    3.609623] md2: detected capacity change from 0 to 98520989696
[    3.609685] md: considering sdc1 ...
[    3.609737] md:  adding sdc1 ...
[    3.609789] md:  adding sdb1 ...
[    3.609841] md:  adding sda1 ...
[    3.610005] md: created md1
[    3.610055] md: bind<sda1>
[    3.610117] md: bind<sdb1>
[    3.610175] md: bind<sdc1>
[    3.610233] md: running: <sdc1><sdb1><sda1>
[    3.610714] md/raid1:md1: not clean -- starting background reconstruction
[    3.610773] md/raid1:md1: active with 3 out of 3 mirrors
[    3.610854] md1: detected capacity change from 0 to 20970405888
[    3.610917] md: ... autorun DONE.
[    3.610999] md: resync of RAID array md1
[    3.611054] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[    3.611119] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[    3.611180] md: using 128k window, over a total of 20478912k.
[    3.611244]  md1: unknown partition table
[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem
[    3.630341] EXT4-fs (md1): write access will be enabled during recovery
[    3.819411] EXT4-fs (md1): orphan cleanup on readonly fs
[    3.836922] EXT4-fs (md1): 24 orphan inodes deleted
[    3.836975] EXT4-fs (md1): recovery complete
[    3.840557] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null)

Nahydrin about 7 years

I checked dmesg and dmesg.old, both only contain the startup information (about 4.8 seconds). The only "problem" I can see is the startup disk or raid drives appear to have something wrong but the system fixes it and works regardless. Check main post for link.
Nahydrin about 7 years

I've ran smart drive tests on all 3 drives, they are uncorrupted. I have dmesg -wH running in a window (I assume until it crashes again; and can still read the output after the crash over SSH). I do not have physical access to the machine, do I ask my host to run the bios memory and hard drive tests?
Nahydrin about 7 years

I have repaired the /dev/md1 not found error when running grub2-probe and installed and configured crashkernel/kdump and will report back if/when it crashes again.