Debugging core files generated on a Customer's box

c++ linux debugging gdb

13,908

Solution 1

What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?

It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.

The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.

But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.

You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.

You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.

cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...

Then, on your system:

mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core  # Note: very important to set solib-... before loading core
(gdb) where      # Get meaningful stack trace!

We then advice the Customer to run a -g binary so it becomes easier to debug.

A much better approach is:

build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it

You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.

Solution 2

You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.

When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?

Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.

There are a number of books about debugging, but I can't think of one I'd recommend.

13,908

Author by

Mohamed Bana

Updated on July 02, 2022

Comments

Mohamed Bana almost 2 years
We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.

I have a few questions:
1. What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
2. Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
```
Program terminated with signal 11, Segmentation fault.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
```
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.

Remote debugging is strictly not allowed.

Thanks
Mohamed Bana almost 12 years

So in the example above I would need to get the libc shared objects because this is where the crash is occurring. The reason I am asking all this is because I have seen cases where the stack traces I am getting from gdb and what the Customer is getting in gdb are totally different, and it is not because the binaries do not match up as I always double-check that the binary used to load the core file is exactly the same as that used to generate it. There is something else at play here which is what I am trying to work out.
Charlie Martin almost 12 years

Yup, sounds like there's another variable. How much access can you get to this customer machine? What you describe would make me wonder about things like library load paths etc. Any chance this application is being run from different accounts with different environments?
Charlie Martin almost 12 years

Wait, I just looked at the actual crash. It's dying in a select call within the kernel. The rest of the stack, that you've omitted, has the information you need to find the select in your code, but what this kind of suggests is that the state of the socket on which you're doing the select has become inconsistent. Like, say, if you somehow closed the socket elsewhere before this select call.
Employed Russian almost 12 years

Chances that OPs executable is actually dying in select are slim to none. I challenge you to make an executable that will die in select with SIGSEGV on purpose. It's harder than you think.
Charlie Martin almost 12 years

The chances that it's dying in select are 100 percent, it's right there in the stack trace, ecco! Now, if you were to read all the way through the comment, you'd note that all I said was something else was likely making the state inconsistent, which would be actually what you're suggesting too. But thank you for playing.
Employed Russian almost 12 years

@CharlieMartin I believe you are 100% mistaken. And I (in my answer) am not at all suggesting that it died in select. "it's right there in the stack" -- that's an illusion, not the truth.
Charlie Martin almost 12 years

Then you're not interpreting what I'm saying correctly. The fault happened when the program counter was in the kernel doing a select system call. That doesn't mean the code calling the select was directly responsible, but it does mean the segmentation violation happened as it executed those instructions. Some address in that code was pointing to a wrong place.
Mohamed Bana almost 12 years

Employed Russian, what if it crashes in a place other than in a shared library, do I still need to get all the shared libraries?
Mohamed Bana almost 12 years

Just wanted to say thanks again. I've had one successful diagnosis with this technique. There were a few problems, e.g., info shared requires to resolve the sym links manually. This isn't a big problem but it is laborious because one has to do this for a lot of the libraries. I might detail these in a separate stackoverflow entry. Thank you, :).
Mohamed Bana almost 12 years

Employed Russian, if we statically link libc, pthreads etc. will I then still need to follow these steps? I am thinking no, but I'd just like to here on your thoughts on this. Have you automated the above process?
Mohamed Bana almost 12 years

Employed Russian, would Address Sanitizer (ASan) be any different from gdb? If it generates a stack-trace do I still need to gather all the shared libraries loaded, even if they are not the top-most/bottom-most function in the stack-trace? This is based on my experience on using the technique you have described --- 1) I ask a Support staff to first run a stack-strace on the machine it cored on. 2) If the stack-trace contains any shared library functions, I have them gather these objects. 3) Debug on dev. machine.
Mohamed Bana almost 12 years

My thinking is that because gdb unwinds the stack it is necessary to have the libraries present, with ASan since we know exactly the symbol offset into the binary, we can just use addr2line on each frame to determine the function it failed on.