Heap corruption under Win32; how to locate?

c++ windows multithreading debugging memory

37,826

Solution 1

I have same problems in my work (we also use VC6 sometimes). And there is no easy solution for it. I have only some hints:

Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfect for dumping.
Remove all catch(...) from your code. They often hide serious memory exceptions.
Check Advanced Windows Debugging - there are lots of great tips for problems like yours. I recomend this with all my heart.
If you use STL try STLPort and checked builds. Invalid iterator are hell.

Good luck. Problems like yours take us months to solve. Be ready for this...

Solution 2

We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.

In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:

Keep track of allocations to find leaks
Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)

Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.

Solution 3

Run the original application with ADplus -crash -pn appnename.exe When the memory issue pops-up you will get a nice big dump.

You can analyze the dump to figure what memory location was corrupted. If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)

After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.

Hopefully this will speedup the monitoring enough to catch the culprit.

In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.

P.S: You can use DebugDiag to analyze the dumps. It can point out the DLL owning the corrupted heap, and give you other usefull details.

Solution 4

You should attack this problem with both runtime and static analysis.

For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.

PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.

Solution 5

Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.

View more solutions

37,826

Author by

Josh

A child prodigy, Josh is gracing StackOverflow with his copious spare time in an attempt to bring some balance to the Force.

Updated on September 02, 2020

Comments

Josh over 3 years
I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.

The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).

The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.

Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.

My current best guesses as to where to next is:
1. Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
2. Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.

A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"

And now, the question: How do I locate the heap corruptor?

Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.

Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.
1. Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.

Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.

Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.

Maybe these tools will allow you at least to narrow the problem to certain component.

I don't hold much hope, but desperate times call for...

And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?

No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).
Alex S over 15 years

"always use strncpy instead of strcpy" - in Microsoft CRT there is an even better alternative, strcpy_s.
alcor about 11 years

remember to read the full msdn specifications with this kind of functions! strange things could happen if you don't read them fully!
Vinz almost 9 years

This was very helpful! I didnt know why my new was suddenly returning NULL after switching Application Verifier on!