How to track down a SIGFPE/Arithmetic exception

10,850

Solution 1

My first suggestion would be to open a memory window looking at the region around your stack pointer, and go digging through it to see if you can find uncorrupted stack frames nearby that might give you a clue as to where the crash was. Usually stack-trashes only burn a couple of the stack frames, so if you look upwards a few hundred bytes, you can get past the damaged area and get a general sense of where the code was. You can even look down the stack, on the assumption that the dead function might have called some other function before it died, and thus there might be an old frame still in memory pointing back at the current IP.

In the comments, I linked some presentation slides that illustrate the technique on a PowerPC — look at around #73-86 for a case study in a similar botched-stack crash. Obviously your ARM's stack frames will be laid out differently, but the general principle holds.

Solution 2

(Using the basic idea from Fedor Skrynnikov, but with compiler help instead)

Compile your code with -pg. This will insert calls to mcount and mcountleave() in every function. Do not link against the GCC profiling lib, but provide your own. The only thing you want to do in your mcount and mcountleave() is to keep a copy of the current stack, so just copy the top 128 bytes or so of the stack to a fixed buffer. Both the stack and the buffer will be in cache all the time so it's fairly cheap.

Solution 3

You can implement special guards in functions that can cause the exception. Guard is a simple class, in constractor of this class you put the name of the file and line (_FILE_, _LINE_) into file/array/whatever. The main condition is that this storage should be the same for all instances of this class(kind of stack). In the destructor you remove this line. To make it works you need to put the creation of this guard on the first line of each function and to create it only on stack. When you will be out of current block deconstructor will be called. So in the moment of your exception you will know from this improvised callstack which function is causing a problem. Ofcaurse you may put creation of this class under debug condition

Solution 4

Enable generation of core files, and open the core file with the debuger

Solution 5

Since it uses raise() to raise the exception, I would expect that signal() should be able to catch it. Is this not the case?

Alternatively, you can set a conditional breakpoint at __aeabi_uldivmod to break when divisor (r1) is 0.

Share:
10,850
celavek
Author by

celavek

Updated on June 05, 2022

Comments

  • celavek
    celavek almost 2 years

    I have a C++ application cross-compiled for Linux running on an ARM CortexA9 processor which is crashing with a SIGFPE/Arithmetic exception. Initially I thought that it's because of some optimizations introduced by the -O3 flag of gcc but then I built it in debug mode and it still crashes.

    I debugged the application with gdb which catches the exception but unfortunately the operation triggering exception seems to also trash the stack so I cannot get any detailed information about the place in my code which causes that to happen. The only detail I could finally get was the operation triggering the exception(from the following piece of stack trace):

        3 raise()  0x402720ac   
        2 __aeabi_uldivmod()  0x400bb0b8    
        1 __divsi3()  0x400b9880
    

    The __aeabi_uldivmod() is performing an unsigned long long division and reminder so I tried the brute force approach and searched my code for places that might use that operation but without much success as it proved to be a daunting task. Also I tried to check for potential divisions by zero but again the code base it's pretty large and checking every division operation it's a cumbersome and somewhat dumb approach. So there must be a smarter way to figure out what's happening.

    Are there any techniques to track down the causes of such exceptions when the debugger cannot do much to help?

    UPDATE: After crunching on hex numbers, dumping memory and doing stack forensics(thanks Crashworks) I came across this gem in the ARM Compiler documentation(even though I'm not using the ARM Ltd. compiler):

    Integer division-by-zero errors can be trapped and identified by re-implementing the appropriate C library helper functions. The default behavior when division by zero occurs is that when the signal function is used, or __rt_raise() or __aeabi_idiv0() are re-implemented, __aeabi_idiv0() is called. Otherwise, the division function returns zero. __aeabi_idiv0() raises SIGFPE with an additional argument, DIVBYZERO.

    So I put a breakpoint at __aeabi_idiv0(_aeabi_ldiv0) et Voila!, I had my complete stack trace before being completely trashed. Thanks everybody for their very informative answers!

    Disclaimer: the "winning" answer was chosen solely and subjectively taking into account the weight of its suggestions into my debugging efforts, because more than one was informative and really helpful.

  • celavek
    celavek almost 13 years
    That could be a viable solution but it doesn't quite work in my case because I have a large code base and I have no clue which part of my code causes the exception. Modifying every class and function to include what you suggest it's not an option. Alternatively I could use backtraces gnu.org/software/libc/manual/html_node/Backtraces.html directly in my code but again it comes with the same downside as for your suggested approach.
  • celavek
    celavek almost 13 years
    Did that already. Same outcome. It cannot pinpoint the exact location which triggers the crash(or at least give me some idea where to look at).
  • BЈовић
    BЈовић almost 13 years
    @celavek In that case, write unit tests for your code. Check the values with assert before doing the division. I do not see anything else that you can do.
  • celavek
    celavek almost 13 years
    a sensible advice :) but it would take days/weeks to cover at least the potential places with unit tests.
  • celavek
    celavek almost 13 years
    I'm trying this suggestion. Might be what I need. Nice and informative slides also, thanks.
  • Crashworks
    Crashworks almost 13 years
    I wish I could be more specific but I haven't worked with ARM in years. Good luck!
  • celavek
    celavek almost 13 years
    The signal is caught by the debugger indeed. But the stack trace is the one I posted in my question which is not complete as the __divsi3() is the actual division "/" operator implementation from libgcc, so it must have been "called" from somewhere in the code, hence my conclusion that the stack is actually corrupted by the exception. I set a conditional breakpoint at __aeabi_uldivmod() with $r1==0 but it's never hit and I still get the crash. One other thing is that I do not have debug symbols for that library as the version of the toolchain I'm using(CodeSourcery-Lite) is not providing that.
  • Michal Gonda
    Michal Gonda about 6 years
    Could you elaborate this technique a little bit more? Thanks ;-)