How hard is it (really) to decompile assembly code?

16,845

Solution 1

Here's the results of decompilation with the Hex-Rays Decompiler after I converted code to x86 (it does not support x64 at the moment), added some data definitions missing in the original post, and assembled it:

//-------------------------------------------------------------------------
// Data declarations

double LCPI1_0 =  1.0; // weak
double LCPI1_1[2] = {  0.0,  0.0 }; // weak
double LCPI1_2 =  1.2; // weak
double LCPI1_3 =  1.3; // weak


//----- (00000000) --------------------------------------------------------
void __usercall mystery(__m128d a1<xmm0>)
{
  __m128d v1; // xmm1@1
  __m128d v2; // xmm1@4
  __int128 v3; // xmm2@4
  __m128d v4; // xmm5@7
  __m128d v5; // xmm1@7

  v1 = (__m128d)*(unsigned __int64 *)&LCPI1_0;
  v1.m128d_f64[0] = LCPI1_0 - a1.m128d_f64[0];
  if ( LCPI1_0 - a1.m128d_f64[0] < 0.0 )
    v1 = _mm_xor_pd(v1, *(__m128d *)LCPI1_1);
  if ( v1.m128d_f64[0] >= LCPI1_2 )
  {
    v2 = (__m128d)*(unsigned __int64 *)&LCPI1_0;
    v3 = *(unsigned __int64 *)&LCPI1_3;
    while ( 1 )
    {
      v4 = a1;
      v4.m128d_f64[0] = (v4.m128d_f64[0] / v2.m128d_f64[0] + v2.m128d_f64[0]) * *(double *)&v3;
      v5 = v4;
      v5.m128d_f64[0] = v5.m128d_f64[0] * v5.m128d_f64[0] - a1.m128d_f64[0];
      if ( v5.m128d_f64[0] < 0.0 )
        v5 = _mm_xor_pd(a1, (__m128d)*(unsigned __int64 *)LCPI1_1);
      if ( v5.m128d_f64[0] < LCPI1_2 )
        break;
      v2 = a1;
    }
  }
}
// 90: using guessed type double LCPI1_0;
// 98: using guessed type double LCPI1_1[2];
// A8: using guessed type double LCPI1_2;
// B0: using guessed type double LCPI1_3;

// ALL OK, 1 function(s) have been successfully decompiled

Clearly, it could use some improvement (XMM support is somewhat basic right now), but I think the basic algorithm is already understandable.

Edit: since it's apparent that only the low double of all XMM registers is used, it seems the function actually works with scalar doubles and not vectors. As for the _mm_xor_pd (xorpd) intrinsic, I think it's just the way the compiler implements sign inversion - by xoring with a predefined constant which has 1s in sign bit positions and 0s everywhere else. With the above in mind, and after some cleanup, I get the following code:

double mystery(double a1)
{
  double v1; // xmm1@1
  double v2; // xmm1@4
  double v3; // xmm2@4
  double v4; // xmm5@7
  double v5; // xmm1@7

  v1 = LCPI1_0 - a1;
  if ( v1 < 0.0 )
    v1 = -v1;
  if ( v1 < LCPI1_2 )
  {
    v4 = LCPI1_0;
  }
  else
  {
    v2 = LCPI1_0;
    v3 = LCPI1_3;
    while ( 1 )
    {
      v4 = a1;
      v4 = (v4 / v2 + v2) * v3;
      v5 = v4;
      v5 = v5 * v5 - a1;
      if ( v5 < 0.0 )
        v5 = -v5;
      if ( v5 < LCPI1_2 )
        break;
      v2 = a1;
    }
  }
  return v4;
}

It produces assembly pretty similar to the original post.

Solution 2

Reverse engineering / decompiling any code is a matter of the time it takes vs the benefit in doing so; not how hard it is to do.

If you have some secret sauce that you absolutely can't allow to get out, then the only thing you can do is have that secret sauce as a web service which gets called upon as necessary. This way the binaries never leave your corporate walls.

Even obfuscation only goes so far as anything can be traced once a hacker has the compiled binaries on a system they control. Heck, the original PC clones were created by reverse engineering the IBM BIOS.

So, back to the point: Again, it's not a question of how hard something is, it's more a question of whether anyone would want to try... which is based on what perceived value they would get out of it. Whether direct dollars (receiving or saving), competitive advantage or simply bragging rights. Compounding this is the availability of the application: wider distribution equals higher potential for finding it's way into a hackers bucket of things to work on.

If those values exist, then you can be assured that someone will try and they will succeed. Which should lead you to the next question: What if they do? What's the worst outcome?

In some cases it's simply a lost sale, that you may not have gotten anyway. In others it could be the loss of the business.

Solution 3

Fundamentally, doing individual machine-instruction "reverse engineering" is pretty easy because machine instructions have extremely well defined semantics. This will give you bad C code, but surely that's not the goal. (Knowing that some binary pattern in a file is a machine instruction is technically Turing-hard, e.g, impossible in some cases; less likely to be so in the case of compiler-generated code).

Beyond that, you are trying infer algorithms and intent. That's extremely hard; where does the knowledge containing all this come from?

You might find my paper on reverse engineering interesting. It suggests a way to encode the necessary knowledge.

There are also commercial tools to do this to some degree. This doesn't go as far as the scheme my paper outlines, but still produces pretty reasonable C code, as I understand it. (I have no specific experience with this tool, but have great respect for the author and his tooling).

Share:
16,845

Related videos on Youtube

lindelof
Author by

lindelof

David's dayjob consists of working as a consultant for an IT company in Geneva. After dark he works on his main interests, which include home and building automation.

Updated on July 09, 2022

Comments

  • lindelof
    lindelof over 1 year

    I'm trying to find hard facts that will help my management understand how hard/easy it is to reverse-engineer compiled C code.

    Similar questions have been asked before on this site (see e.g. Is it possible to “decompile” a Windows .exe? Or at least view the Assembly? or Possible to decompile DLL written in C?), but the gist of these questions is that decompiling compiled C code is "hard, but not entirely impossible".

    In order to facilitate answers that are based in fact, I am including compiled code for a mystery function, and I propose that answers to this question measure the success or failure of the proposed techniques by whether they can determine what this function does. This may be unusual for SO but I think it's the best way to get "good subjective" or factual answers to this engineering question. Therefore, What is your best guess at what this function is doing, and how?

    This is the compiled code, compiled on Mac OSX with gcc:

    _mystery:
    Leh_func_begin1:
        pushq   %rbp
    Ltmp0:
        movq    %rsp, %rbp
    Ltmp1:
        movsd   LCPI1_0(%rip), %xmm1
        subsd   %xmm0, %xmm1
        pxor    %xmm2, %xmm2
        ucomisd %xmm1, %xmm2
        jbe     LBB1_2
        xorpd   LCPI1_1(%rip), %xmm1
    LBB1_2:
        ucomisd LCPI1_2(%rip), %xmm1
        jb      LBB1_8
        movsd   LCPI1_0(%rip), %xmm1
        movsd   LCPI1_3(%rip), %xmm2
        pxor    %xmm3, %xmm3
        movsd   LCPI1_1(%rip), %xmm4
        jmp     LBB1_4
        .align  4, 0x90
    LBB1_5:
        ucomisd LCPI1_2(%rip), %xmm1
        jb      LBB1_9
        movapd  %xmm5, %xmm1
    LBB1_4:
        movapd  %xmm0, %xmm5
        divsd   %xmm1, %xmm5
        addsd   %xmm1, %xmm5
        mulsd   %xmm2, %xmm5
        movapd  %xmm5, %xmm1
        mulsd   %xmm1, %xmm1
        subsd   %xmm0, %xmm1
        ucomisd %xmm1, %xmm3
        jbe     LBB1_5
        xorpd   %xmm4, %xmm1
        jmp     LBB1_5
    LBB1_8:
        movsd   LCPI1_0(%rip), %xmm5
    LBB1_9:
        movapd  %xmm5, %xmm0
        popq    %rbp
        ret 
    Leh_func_end1:
    

    UPDATE

    @Igor Skochinsky is the first to find the right answer: it is indeed a naive implementation of Heron's algorithm for calculating square roots. The original source code is here:

    #include <stdio.h>
    
    #define EPS 1e-7
    
    double mystery(double x){
      double y=1.;
      double diff;
      diff=y*y-x;
      diff=diff<0?-diff:diff;
      while(diff>=EPS){
        y=(y+x/y)/2.;
        diff=y*y-x;
        diff=diff<0?-diff:diff;
      }
      return y;
    }
    
    int main() {
      printf("The square root of 2 is %g\n", mystery(2.));
    }
    
    • Kerrek SB
      Kerrek SB almost 11 years
      You have 7k+ reputation and address "the site moderators"?? Haven't you worked out how this site works?
    • djechlin
      djechlin almost 11 years
      I'm wondering if I should just start a meta.so thread to handle the legitimacy of this question now...
    • CodesInChaos
      CodesInChaos almost 11 years
      Pity I don't have a hexrays decompiler license. I suspect it's simplify that code a lot.
    • Oliver Charlesworth
      Oliver Charlesworth almost 11 years
      @djechlin: How is "guess what my assembler does?" ever a valid question? (or was that sarcasm?)
    • Clifford
      Clifford almost 11 years
      I would suggest when explaining this to "management" using the phrase "impractical". The cases where it is even remotely possible require at least most of the following to be true: a) You know what compiler was used, b) you know what compile options were used, c) The code was not optimised, d) the symbol table was not stripped and e) the code is pretty short.
    • djechlin
      djechlin almost 11 years
      @OliCharlesworth because the OP is struggling with the fact that the question in title is legitimate but the answers are not constructive. Forcing answers to be able to resolve OP's question brings the question to being constructive. Not sarcastic; note the re-open votes.
    • Oliver Charlesworth
      Oliver Charlesworth almost 11 years
      @djechlin: Hmmm, I fail to see how the question title could be answered objectively, let alone this idea of a decompilation challenge!
    • djechlin
      djechlin almost 11 years
      @OliCharlesworth sooooo yes there should be a thread on meta on this topic?
    • Oliver Charlesworth
      Oliver Charlesworth almost 11 years
      @djechlin: I'm not sure how that follows ;)
    • Bo Persson
      Bo Persson almost 11 years
      @lindelof - I'll give you another example here where 10 lines of inlined functions and C++ templates are compiled into 4-5 machine instructions. What are the odds that anyone can reproduce the original source code?
    • lindelof
      lindelof almost 11 years
      Thanks everybody for the comments so far. Just a few notes: @Kerrek SB not everybody has the right on this site to moderate questions. I was simply addressing those of you with close rights. Oli Charlesworth: my thinking was that this site draws probably the top programmers on the planet. I'm simply piggy-backing on this site's popularity to see whether anyone has a tool that can decompile a simple function. Bo Persson: thanks for that example, I find it very instructive.
    • old_timer
      old_timer almost 11 years
      In general it is impossible, the original source is absolutely impossible, in rare cases where no optimizer was used and the code was so trivial that you dont need to bother going back to C, then you could reconstruct something that is functionally the same.
    • old_timer
      old_timer almost 11 years
      Think of this as converting a wav file into an mp3, (an image to jpg, a movie to mpeg, etc) a lossy compression. You cannot get back the original signal. The same thing happens in the compiler, information from the source code being compile is lost, is not visible in the output, you cannot go back to the original. Functionally similar C code where possible is no more readable or maintainable than the assembly language, you are better off if you have to make modifications to do it in asm or write C code by hand from an analysis of the asm.
    • Aki Suihkonen
      Aki Suihkonen almost 11 years
      The constants LCPI1_[x] would be needed. It would help in a) generating the binary compatible code and b) to help recognize a textbook algorithm (such as inverse reciprocal)
  • Ira Baxter
    Ira Baxter almost 11 years
    So, what is your best guess as to what this code is doing? I think you need algorithm recognition on top of low level code recovery. PS: nice job reverse engineering to where you got, +1 in spite of being closed :)
  • Igor Skochinsky
    Igor Skochinsky almost 11 years
    Looks like the Babylonian method of square root calculation. LCPI1_0 is the initial approximation, LCPI1_2 is epsilon, and LCPI1_3 is the constant 0.5.
  • lindelof
    lindelof almost 11 years
    @IgorSkochinsky congratulations, you dit it!