Solving random crashes

29,831

Solution 1

Try Valgrind (it's free, open-source):

The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes two experimental tools: a heap/stack/global array overrun detector, and a SimPoint basic block vector generator. It runs on the following platforms: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux, and X86/Darwin (Mac OS X).

Valgrind Frequently Asked Questions

The Memcheck part of the package is probably the place to start:

Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.

  • Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.

  • Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.

  • Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]

  • Overlapping src and dst pointers in memcpy and related functions.

  • Memory leaks.

Solution 2

First, you are lucky that your process crashes multiple times in a short time-period. That should make it easy to proceed.

This is how you proceed.

  • Get a crash dump
  • Isolate a set of potential suspicious functions
  • Tighten up state checking
  • Repeat

Get a crash dump

First, you really need to get a crash dump.

If you don't get crash dumps when it crashes, start with writing a test that produces reliable crash dumps.

Re-compile the binary with debug symbols or make sure that you can analyze the crash dump with debug symbols.

Find suspicious functions

Given that you have a crash dump, look at it in gdb or your favorite debugger and remember to show all threads! It might not be the thread you see in gdb that is buggy.

Looking at where gdb says your binary crashed, isolate some set of functions you think might cause the problem.

Looking at multiple crashes and isolating code sections that are commonly active in all of the crashes is a real time-saver.

Tighten up state checking

A crash usually happens because some inconsistent state. The best way to proceed is often to tighten the state requirements. You do this the following way.

For each function you think might cause the problem, document what legal state the input or the object must have on entry to the function. (Do the same for what legal state it must have on exit from the function, but that's not too important).

If the function contains a loop, document the legal state it needs to have at the beginning of each loop iteration.

Add asserts for all such expressions of legal state.

Repeat

Then repeat the process. If it still crashes outside of your asserts, tighten the asserts further. At some point the process will crash on an assert and not because of some random crash. At this point you can concentrate on trying to figure out what made your program go from a legal state on entry to the function, to an illegal state at the point where the assert happened.

If you pair the asserts with verbose logging it should be easier to follow what the program does.

Solution 3

If all else fails (particularly if performance under the debugger is unacceptable), extensive logging. Start with the entry points -- is the app transactional? Log each transaction as it comes in. Log all the constructor calls for your key objects. Since the crash is so intermittent, log calls to all the functions that might not get called every day.

You'll at least start narrowing down where the crash could be.

Solution 4

Start the program under debugger (I'm sure there is a debugger together with GCC and MingW) and wait until it crashes under debugger. At the point of crash you will be able to see what specific action is failing, look into assembly code, registers, memory state - this will often help you find the cause of the problem.

Solution 5

Where I work, crashing programs usually generates a core dump file that can be loaded in windbg.

We then have an image of the memory at the time the program crashed. There's nothing much you can do with it, but a least it gives you the last call stack. Once you know the function which crashed, you might then be able to track down the problem are at least you might reduce the problem to a more reproductible test-case.

Share:
29,831
speeder
Author by

speeder

Updated on February 16, 2020

Comments

  • speeder
    speeder about 4 years

    I am getting random crashes on my C++ application, it may not crash for a month, and then crash 10 times in a hour, and sometimes it may crash on launch, while sometimes it may crash after several hours of operation (or not crash at all).

    I use GCC on GNU/Linux and MingW on Windows, thus I can't use the Visual Studio JIT Debug...

    I have no idea on how to proceed, looking randomly on the code would not work, the code is HUGE (and good part was not my work, also it has some good amount of legacy stuff on it), and I also don't have a clue on how to reproduce the crash.

    EDIT: Lots of people mentioned that... how I make a core dump, minidump or whateverdump? This is the first time I need postmortem debugging.

    EDIT2: Actually, DrMingw captured a call stack, no memory info... Unfortunately, the call stack don't helped me much, because near the end suddenly it go into some library (or something) that I don't have debug info, resulting only into some hexadecimal numbers... So I still need some decent dump that give more information (specially about what was in the memory... specifically, what was in the place that gave the "access violation" error)

    Also, my application use Lua and Luabind, maybe the error is being caused by a .lua script, but I have no idea on how to debug that.

  • ereOn
    ereOn over 13 years
    Does valgrind finally run on Windows ? I've been looking for that for years now.
  • ereOn
    ereOn over 13 years
    It could be a race condition or anything else that results in undefined behavior. We don't have enough information to do educated guesses.
  • fhd
    fhd over 13 years
    Yeah, that's why the core dump should help. He said he didn't want to look at the source code randomly and I agree. The core dump should get him started.
  • Nicholas Knight
    Nicholas Knight over 13 years
    @ereOn: Unfortunately no, it does not, but the OP is also using Linux, so it should be an option for him. Only Linux and OS X are really supported right now, though there are unofficial ports for FreeBSD and NetBSD. see valgrind.org/info/platforms.html
  • ereOn
    ereOn over 13 years
    I used to do the same, but I noticed that often, logging causes the program to do I/O which sometimes prevents some bugs/race conditions to happen. I believe logging is a more effective technique when you have a bug that occurs in a deterministic way.
  • Jason Orendorff
    Jason Orendorff over 13 years
    +1. Valgrind can often hand you the line number of your bug for zero effort. It's like magic.
  • paxdiablo
    paxdiablo over 13 years
    Yeah, ya gotta love those Heisenbugs.
  • paxdiablo
    paxdiablo over 13 years
    I'll +1 this as well. Having only recently started using this, I find it damn-near indispensable.
  • Nordic Mainframe
    Nordic Mainframe over 13 years
    Valgrind is great but unforunately won't catch errors on Windows/MINGW because it does not exist there. Possible replacements: * stackoverflow.com/questions/413477/…
  • Nordic Mainframe
    Nordic Mainframe over 13 years
    Could you give some details? My latest info regarding mingw is that mingw-gcc binaries can't generate core dumps and windbg has very little to say about mingw binaries because they use the stabs debugging format which windbg doesn't understand.
  • Mitch Wheat
    Mitch Wheat over 13 years
    @Luther Blissett : poster is also running on Linux
  • Nordic Mainframe
    Nordic Mainframe over 13 years
    @Mitch: My comment does not deny that.
  • ereOn
    ereOn over 13 years
    @Luther Blissett, unfortunately, the core dumps files seem generated by the system (I work for a very big company and i'm not part of the team that actually set this up). However I'm sure that my test binaries (created with mingw) are "core-dumped" on crashes, and I highly doubt the team in charge added a special case for this.
  • JustBoo
    JustBoo over 13 years
    I believe these are called (MS term) "Minidumps." windbg has a setting to read these "post-mortem" and can reveal "stuff."
  • speeder
    speeder over 13 years
    My question is exactly because I can't use for example Valgrind all the time... Valgrind makes the program SLOOOOOOW, INCREDIBLY SLOOOOW. And it may take hours to crash, or months... I can't work a entire month with the program running on Valgrind...
  • speeder
    speeder over 13 years
    I can't do that, the performance under the debugger is too slow to make the program useful at all, and it may take a LOOOOONG time before crashing. So it would require me to use GDB all the time, and for this project this is totally unreasonable.
  • speeder
    speeder over 13 years
    Valgrind is useless for this project, because it run so slow, that until some error happen that valgrind can catch, I might be dead of old age...
  • speeder
    speeder over 13 years
    Ooooh... That stuff actually worked! There are one particular crash that I know how to cause (but I don't know how to fix), that I used to test DrMingw Too bad it offer no information about memory, only about the call stack... :(
  • sharptooth
    sharptooth over 13 years
    @speeder: I personally have never seen any difference in speed when running under debugger. I don't mean step-by-step, I mean just run and leave it running until it crashes.
  • speeder
    speeder over 13 years
    I usually don't either, but my program debug version binary has 140Mb, it also loads more 100Mb of data (good part generated on the fly) and GDB itself when loaded with my program take more 200Mb... This results in the OS going nuts with page files, also my memory is not the greatest one out there (in fact it is quite old, and 2GB in total...)
  • sharptooth
    sharptooth over 13 years
    @speeder: You could do the following: compile the program with .pdb and with optimizations. This way it will not be as bloated as a full debug version and you will still be able to see the callstack when it crashes under debugger.
  • Douglas Leeder
    Douglas Leeder over 13 years
    Unfortunately valgrind or some other memory checker is the best thing I can suggest. Otherwise you pretty much have to rewrite the application.
  • Donal Fellows
    Donal Fellows over 13 years
    I hate Heisenbugs/Schrödingbugs. Getting rid of them so that behavior is predictable (possibly leading to a crash, but then with a known cause) is very important, since that almost always leads shortly after to fully working code…
  • leander
    leander over 13 years
    +1 for Application Verifier. I'd actually start there, if valgrind is too slow.
  • Петър Петров
    Петър Петров about 9 years
    Nothing stops you to debug true Release Mode build. You only need a PDB file that will be generated from a Debug build, you can steal from there... but better, set-up your Release build to generate Debug info. Note that, however, optimizations will break your debugging experience, variables will be missing and so on, but you can still get a lot of help this way.
  • Петър Петров
    Петър Петров about 9 years
    Sometimes a crash is there just because of "mis-build". Clean & Rebuild will fix this issue.