Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

performance optimization assembly x86-64 freepascal

37,220

Solution 1

The most likely cause of the speed improvement is that:

inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly

Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.

This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).

If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643

Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.

Solution 2

You may want to read http://research.google.com/pubs/pub37077.html

TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).

Solution 3

I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.

Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).

CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.

The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.

37,220

tangentstorm

I make videos about programming! https://www.youtube.com/c/tangentstorm I ♥ ... K J Rust Free Pascal python retroforth picolisp coffeescript haskell

Updated on May 16, 2020

Comments

tangentstorm about 4 years
Background:

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.

To my surprise, removing the un-necessary instruction caused my program to slow down.

I found that adding arbitrary, useless MOV instructions increased performance even further.

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.

The data:

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):
```
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without:        1836.44 ms
```
The programs were run 25 times in a loop, with the run order changing randomly each time.

Excerpt:
```
{$asmmode intel}
procedure example_junkop_in_sha256;
  var s1, t2 : uint32;
  begin
    // Here are parts of the SHA-256 algorithm, in Pascal:
    // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
    // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
    // Here is how I translated them (side by side to show symmetry):
  asm
    MOV r8d, a                 ; MOV r9d, e
    ROR r8d, 2                 ; ROR r9d, 6
    MOV r10d, r8d              ; MOV r11d, r9d
    ROR r8d, 11    {13 total}  ; ROR r9d, 5     {11 total}
    XOR r10d, r8d              ; XOR r11d, r9d
    ROR r8d, 9     {22 total}  ; ROR r9d, 14    {25 total}
    XOR r10d, r8d              ; XOR r11d, r9d

    // Here is the extraneous operation that I removed, causing a speedup
    // s1 is the uint32 variable declared at the start of the Pascal code.
    //
    // I had cleaned up the code, so I no longer needed this variable, and 
    // could just leave the value sitting in the r11d register until I needed
    // it again later.
    //
    // Since copying to RAM seemed like a waste, I removed the instruction, 
    // only to discover that the code ran slower without it.
    {$IFDEF JUNKOPS}
    MOV s1,  r11d
    {$ENDIF}

    // The next part of the code just moves on to another part of SHA-256,
    // maj { r12d } := (a and b) xor (a and c) xor (b and c)
    mov r8d,  a
    mov r9d,  b
    mov r13d, r9d // Set aside a copy of b
    and r9d,  r8d

    mov r12d, c
    and r8d, r12d  { a and c }
    xor r9d, r8d

    and r12d, r13d { c and b }
    xor r12d, r9d

    // Copying the calculated value to the same s1 variable is another speedup.
    // As far as I can tell, it doesn't actually matter what register is copied,
    // but moving this line up or down makes a huge difference.
    {$IFDEF JUNKOPS}
    MOV s1,  r9d // after mov r12d, c
    {$ENDIF}

    // And here is where the two calculated values above are actually used:
    // T2 {r12d} := S0 {r10d} + Maj {r12d};
    ADD r12d, r10d
    MOV T2, r12d

  end
end;
```
Try it yourself:

The code is online at GitHub if you want to try it out yourself.

My questions:
- Why would uselessly copying a register's contents to RAM ever increase performance?
- Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
- Is this behavior something that could be exploited predictably by a compiler?
- Brett Hale almost 11 years
  
  There are all sorts of 'useless' instructions that can actually serve to break dependency chains, mark physical registers as retired, etc. Exploiting these operations requires some knowledge of the microarchitecture. Your question should provide a short sequence of instructions as a minimal example, rather than directing people to github.
- tangentstorm almost 11 years
  
  @BrettHale good point, thanks. I added a code excerpt with some commentary. Would copying a register's value to ram mark the register as retired, even if the value in it is used later?
- johnfound almost 11 years
  
  try to exchange the order of "mov r8d, a" and "mov r9d, b" without "mov s1, r11d". r8d is used in the instruction "xor r10d, r8d" and "mov r8d, something" can not be executed before the end of the "xor".
- starwed almost 11 years
  
  Can you put the standard deviation on those averages? There's no actual indication in this post that there's a real difference.
- jakobbotsch almost 11 years
  
  Can you please try timing the instructions using the rdtscp instruction, and check the clock cycles for both versions?
- JustinDanielson almost 11 years
  
  Since both useless instructions are writing to register s1 you have a write after write dependency and the instruction will get throw out when the processor sees the second write writing to s1 in the queue. My guess was that it is putting the value in the queue so that it can be forwarded rather than fetched from the register file, but I don't think that's the case here.
- Josell almost 11 years
  
  I believe there are 'magic' numbers in computing, like 4, 16, 128 and other exponents of 2. I think that these useless instructions may complete these numbers, so they are more suitable for computing operations. -- However, that's so low level.
- Eamon Nerbonne almost 11 years
  
  @starwed: I believe you mean standard error, not standard deviation, though they're closely related. And yes, he should really, really be including some more details about the distribution, not just standard error, but min/max/median/quartiles too. Also, 25 runs just sounds a little low, to me, for such a small difference.
- Mark Ransom almost 11 years
  
  I had a similar question a few years ago: stackoverflow.com/questions/688325/…. Unfortunately the last time I tried to repeat those results I couldn't do it.
- Lorenzo Dematté almost 11 years
  
  Can it also be due to memory alignment? I didn't do the math myself (lazy :P) but adding some dummy instructions can cause your code do be memory aligned...
- Display Name over 10 years
  
  Do any of real world JIT compilers exploit such techniques? (for example, mainstream JVMs, Microsoft .NET, maybe something else)
tangentstorm almost 11 years

Interesting. But is the processor (or FPC) smart enough to see that writing to ram is a NOP in this case?
Marco van de Voort almost 11 years

Assembler is not optimized.
alcuadrado almost 11 years

Your answer is kind of confusing. In many places it seems like you are guessing, although most of what you say is correct.
AdamIerymenko almost 11 years

Compilers could exploit it by doing incredibly expensive optimizations like repeatedly building and profiling and then varying the compiler output with a simulated annealing or genetic algorithm. I've read about some work in that area. But we're talking a minimum of 5-10 minutes of 100% CPU to compile, and the resulting optimizations would probably be CPU core model and even core or microcode revision specific.
Jonathan Leonard almost 11 years

@alcuadrado FWIW, I found the answer entirely comprehensible (even though the grammar wasn't perfect and the style awkward at times).
alcuadrado almost 11 years

Maybe I should clarify. What I find confusing is the lack of certainty
PuercoPop almost 11 years

I wouldn't call it random NOP, they explain why NOPs can have a positive effect on the performance (tl;dr: stackoverflow.com/a/5901856/357198) and the random insertion of NOP did result in performance degradation. What is interesting of the paper is that the removal of 'strategic' NOP by GCC had no effect on performance overall!
Jamon Holmgren almost 11 years

Mentally remove "I believe" from the beginning and there is no lack of certainty anymore. I'd submit an edit to that effect, but I don't have the technical knowledge to know if certainty is warranted.
jturolla almost 11 years

guessing that makes sense and with good argumentation is completely valid.
Alex D almost 11 years

No-one can really know for sure why the OP is observing this strange behavior, unless it was an engineer at Intel who had access to special diagnostic equipment. So all others can do is guess. That isn't @cowarldlydragon's fault.
tangentstorm almost 11 years

Hmm. This sounds promising. The only conditional branches in this sha256 implementation are the checks for the end of the FOR loops. At the time, I had tagged this revision as an oddity in git and continued optimizing. One of my next steps was to rewrite the pascal FOR loop myself in assembly, at which point these extra instructions no longer had a positive effect. Perhaps free pascal's generated code was harder for the processor to predict than the simple counter I replaced it with.
Raymond Hettinger almost 11 years

@tangentstorm That sounds like a good summary. The branch prediction table isn't very large, so one table entry might refer to more than one branch. This can render some predictions useless. The problem is easily fixed if one of the conflicting branches moves to another part of the table. Almost any little change can make this happen :-)
tangentstorm almost 11 years

I think this is the most reasonable explanation of the specific behavior I observed, so I'm going to mark this as the answer. Thanks. :)
leander almost 11 years

There is an absolutely excellent discussion of a similar problem one of the contributers to Bochs ran into, you may want to add this to your answer: emulators.com/docs/nx25_nostradamus.htm
fuz over 9 years

Downvote; none of what you say explains the behaviour OP is seeing. Your answer is useless.
Peter Cordes over 8 years

Some of these guesses are wrong. "CPUs ... apply several optimization approaches in parallel..." nope. Things in the OOO core are pretty deterministic. "maybe the cache controller slots it more intelligently": nope, there's no way in which that guess matches reality. The only way in which answer this right is that it's saying "because modern OOO cores are complex, and there are interactions between nearby instructions". See the microarch pdf at agner.org/optimize to find out what those are, specifically. In this case, it's most likely an alignment -> decoder issue, since Core2...
Peter Cordes over 8 years

Insn alignment matters for a lot more than just branch targets. Decode bottlenecks are a huge issue for Core2 and Nehalem: it often has a hard time keeping its execution units busy. Sandybridge's introduction of the uop cache increased frontend throughput a huge amount. Aligning branch targets is done because of this issue, but it affects all code.
Cody Gray about 7 years

This is all true, but has nothing to do with the code presented in the question.
Maxim Masiutin about 7 years

@CodyGray - thank you for your feedback. I have edited the reply and added a chapter about the case - that mov to memory surrounded by register operations prepares the cache and it's free since the store unit is idle anyway. So the subsequent store operation will be faster.
phuclv over 5 years

there's no MOVZX for 32-bit operands, because all instructions with 32-bit destination zero the upper part of the full 64-bit register