How to troubleshoot a hardware problem on linux?
Solution 1
Try booting memtest86+ from bootable media and see what it says about your memory and memory subsystem integrity.
Also, the last job started might get logged in Cron to /var/log/syslog or /var/log/messages.
If not, and debugging this issue on an ongoing basis, you could set up auditd and a cron job with ps to log system activity and what jobs are running on a continuous basis.
Solution 2
Kernal devices will report problems to dmesg
, which may be logged separately as well, or in kern.log
.
For serious problems, a POST diagnostics board may be used.
Solution 3
On most linux' today, you should be able to have an MCE log (Machine Check Exception) which may be decoded to find the actual hardware errors (http://freshmeat.net/projects/mcelog/). Also, you may run a Kernel Crash Dump, a kernel that runs the linux kernel you're using daily, and with this capture the incident and debug the cause
- http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html
- http://kerneltrap.org/node/5758
- http://www.mjmwired.net/kernel/Documentation/kdump/
Solution 4
Logs are the first place to look, as kmarsh says, but if the logs don't tell much in the case of a serious HW failure, then it doesn't matter what OS you use, it just takes some old school trial and error.
Determine if it is a hardware issue by running a live CD, otherwise it could be a driver issue misdiagnosed as hardware failure.
HW lockups are random, but frequent. I'd start with removing graphics cards (use on-board or backup cards), network cards or (gasp) modems if you have any, one at a time until you pinpoint the culprit. Run with one memory stick at a time (if you have x2) or swap out for other sticks while testing.
Your PSU could also be failing, sometimes adding a new card eats your watts, starving the CPU if your PSU isn't powerful enough, causing random fails.
If nothing else gives a lead, it could be your main board (usually corrosion if it's 2+yrs depending on the humidity where you live) or CPU.
Use software to monitor CPU temperature, overheating can cause lockups too.
After trying everything under the sun, with no luck, it might be time for a new PC ;)
Related videos on Youtube
Jack
Updated on September 17, 2022Comments
-
Jack almost 2 years
Just to note I am not having a problem at the moment, but have had previously so it sparked my curiosity...
When a computer locks up suddenly to so caps lock flashes incessantly and the only possibility to restart....how do you troubleshoot what is causing it? On Windows there would be some errors in the event log...on Linux it seems there is no opportunity for anything to be written to the log, making it hard to troubleshoot...
In this case, how would you troubleshoot the problem through linux?
-
kmarsh about 14 yearsSudden H/W lockups rarely get logged by any operating system.
-
Jack about 14 yearsWell, they do on windows, even if it is vague....
-
quack quixote about 14 yearsnot always. depends on the problem; if it's a true hardware freeze, the first indication Windows will give (in the error logs) is that it's rebooting. (BSoDs are not true hardware lockups in this sense.)
-
Marius Gedminas almost 14 yearsFlashing caps lock indicates a kernel panic (which is more or less the same as the BSoD on Windows). It's not necessarily a hardware problem, it could be a bug in the kernel/drivers.
-
-
Jack about 14 yearsNo no...as I said I am not having any problem at the moment, I just want to know the equivilant way to see hardware problme on linux, as I can on windows.
-
Kevin M about 14 yearsDriver errors can happen on a live CD just as on a fully-installed system. It all matters on what drivers the system is using. If you use only generic drivers and it still happens, THEN it would be a HW issue.