How to track down the cause of Windows Server 2008 crashing?

41,343

Solution 1

2009-07-06 - I'm thinking its the hard drive.

I did a chkdsk, and it crashed with the same symptoms as before half way through the chkdsk. I'm using a Solid State Drive (SSD), the "PQI DK9128GD6R000A03 128GB SATA 2.5" SSD", with a MTBF of 1,500,000 hours. Despite having a MTBF of 133 years, it seem to have died after 2 weeks or normal use! To check my theory, I copied the VMware files to a standard hard drive. Ran chkdsk, and it worked like a charm. I'll see if the system survives a week of uptime, and if it does I can officially defenestrate my PQI SSD.

2009-07-07 - System crashed again. Back to the drawing board.

2009-07-08 - Rolled back a further 20 days to before I installed the SSD. We'll see if it crashes again (it did).

2009-07-09 - uninstalled OpenVPN, upgraded to the latest version of Skype, upgraded to SQL 2008 to SP1, removed TeamViewer. We'll see if it crashes again (it did, in the middle of an Acronis backup).

2009-07-09 - suspect that the amount of virtual memory available the VMware machine that runs the server is too small, I've got it at 4GB at the moment. Increasing it (this had no effect).

2009-07-09 - discovered that if the VMware container running Windows Server 2008 crashes with 100% CPU utilization, and I pause/restart it, then it uncrashes and resumes operation! This tends to point to a problem with VMware or its host OS (which is XP), rather than a problem within the Windows Server 2008 itself. Getting very close to the heart of the problem now.

2009-07-09 - Windows Server 2008 only crashes when the host OS is under very heavy load. Increased the number of CPU's it can utilize to 2 CPU's, this seems to have fixed the problem.

In conclusion:

  1. Original problem was caused by a bad hard drive with bad sectors (it was actually a 128GB SSD from PQI - wouldn't expect a Solid State Drive (SSD) to fail two weeks after purchase but this one did).
  2. Next problem was caused by the host OS that was running VMware coming under high load. Fixed this by allocating more RAM and increasing the size of the page file.
  3. If it happens again, I have a workaround (just pause/restart VMware v6.5 to "unfreeze" Windows Server 2008 running inside of it).

Problem solved, thanks guys!

Solution 2

You can also use the "Reliability and Performance Monitor" that is available under Windows Server 2008.

As you can see below, it automatically keeps a record of the reliability of the server, and assigns it a "reliability score" out of 10. This score starts at 10, and drops if the server experiences any crashes or unexpected shutdowns.

It even keeps a record of which programs were installed, and when, so you can diagnose if an installed program seemed to cause more faults.

You can also set it up to continuously log the CPU usage of programs, to see which program is causing the 100% CPU utilization.

enter image description here

Solution 3

If there is a crash-dump like c:\windows\memory.dmp you can use the WinDbg to analyze it. Usually you want to look for third party drivers in the dump. Step-by-step instructions can be found here.

Solution 4

You have two options:

  • Look at records to try and figure out what caused past problems
  • Look for signs of things that could lead to the CPU spikes in an attempt to replicate the problem

Logs are a good start for looking back at the history of the system, if you know the time where the problems start or the logs are quiet enough for you to notice a pattern leading to the pegged CPU. If the system BSOD you can throw the dmp's into windbg.

If you're looking for things that could lead to the CPU spikes:

  • Process Explorer from sysinterals: look for odd processes or open handles to files or network shares that don't exist anymore. It may point you in the right direction to replicate the problem
  • Windows Reliability and Performance Monitor / Perfmon: You can see how each process is acting in regards to Disk/CPU/Memory/Network usage as well as hundreds of other counters. They may give you a clue as to what is running away with the VM before it happens.

Once you have a good candidate for the problems you can turn on Process Monitor from sysinternals. It will dump every file and registry interaction that every process on the system is doing in real time. It can even be configured to load at boot and capture everything until you run the GUI next (be warned this is A LOT of data, so it's only advisable if you can replicate the problem quickly after boot)

There are a bunch of rabbit holes that an root cause analysis can take you down, feel free to let us know on how it goes.

Solution 5

The System event log. The Application Event log. Google the message of the BSOD. Check the disk's integrity with chkdsk.

Share:
41,343

Related videos on Youtube

Contango
Author by

Contango

Have been programming for over 22 years (since I was at elementary school). Experienced in C/C++, C#, Python, SQL, etc on both Linux and Windows. Currently working in finance / financial services in the Front Office of one of the larger market makers in Europe. Experienced in full stack development involving WPF, MVVM, DI, OMS, EMS, FIX, FAST, WCF, Tibco, Bloomberg API, etc. Have built systems running at discretionary trading timescales right down to high frequency trading (HFT) timescales, working on everything from the DAL to the user interface. Passionate about great software architecture, writing bug free, maintainable and performant code, and designing great user interfaces.

Updated on September 17, 2022

Comments

  • Contango
    Contango over 1 year

    I have Windows Server 2008 running under VMware.

    Recently, its started to crash roughly every day, with continuous 100% CPU utilization, and no response in the GUI.

    Is there a step-by-step technique to track down the source of this problem?

    What logs would I look at?

    p.s. The problem appeared around the time I tried to uninstall Acronis, and it blue screened. However, I'm not sure if the current faults are related to Acronis at all.

  • Contango
    Contango almost 15 years
    Its not exactly every 24 hours; it seems to be random. Its roughly every day.
  • Contango
    Contango almost 15 years
    Its not a BSOD. The VMware machine is hanging at 100% CPU, with a frozen mouse.
  • Contango
    Contango almost 15 years
    Its not the BSOD. Thanks for the tip though - this will come in useful in the future.
  • Contango
    Contango almost 15 years
    Problem is solved now. Thank you for your comments, it helped point me in the right direction.
  • Contango
    Contango almost 15 years
    Problem is solved now. Thank you for your comments, it helped point me in the right direction.
  • Contango
    Contango almost 15 years
    Problem is solved. Thanks for your comment - it helped me work out that the original problem was a bad SSD drive (it always crashed when it got to the same sector when I was running chkdsk).
  • MikeKulls
    MikeKulls about 8 years
    In that page the link to the debugging tool is buried down the bottom somewhere. This is the direct link go.microsoft.com/fwlink/p/?LinkId=536682