Enforcing statement order in C++

21,006

Solution 1

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.

Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.

We could try to prevent this, but it would have extremely negative results and would ultimately fail.

First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.

But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.

The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.

But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk

The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization

The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.

Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.

Solution 2

Summary:

There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.

Original answer:

GCC reorders the calls under -O2 optimisation:

#include <chrono>
static int foo(int x)    // 'static' or not here doesn't affect ordering.
{
    return x*2;
}
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

GCC 5.3.0:

g++ -S --std=c++11 -O0 fred.cpp :

_ZL3fooi:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %ecx, 16(%rbp)
        movl    16(%rbp), %eax
        addl    %eax, %eax
        popq    %rbp
        ret
_Z4fredi:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        movl    %ecx, 16(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -16(%rbp)
        movl    16(%rbp), %ecx
        call    _ZL3fooi
        movl    %eax, -4(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -32(%rbp)
        movl    -4(%rbp), %eax
        addq    $64, %rsp
        popq    %rbp
        ret

But:

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        call    _ZNSt6chrono3_V212system_clock3nowEv
        leal    (%rbx,%rbx), %eax
        addq    $32, %rsp
        popq    %rbx
        ret

Now, with foo() as an extern function:

#include <chrono>
int foo(int x);
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %ecx
        call    _Z3fooi
        movl    %eax, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %eax
        addq    $32, %rsp
        popq    %rbx
        ret

BUT, if this is linked with -flto (link-time optimisation):

0000000100401710 <main>:
   100401710:   53                      push   %rbx
   100401711:   48 83 ec 20             sub    $0x20,%rsp
   100401715:   89 cb                   mov    %ecx,%ebx
   100401717:   e8 e4 ff ff ff          callq  100401700 <__main>
   10040171c:   e8 bf f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401721:   e8 ba f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401726:   8d 04 1b                lea    (%rbx,%rbx,1),%eax
   100401729:   48 83 c4 20             add    $0x20,%rsp
   10040172d:   5b                      pop    %rbx
   10040172e:   c3                      retq

Solution 3

Reordering may be done by the compiler, or by the processor.

Most compilers offer a platform-specific method to prevent reordering of read-write instructions. On gcc, this is

asm volatile("" ::: "memory");

(More information here)

Note that this only indirectly prevents reordering operations, as long as they depend on the reads / writes.

In practice I haven't yet seen a system where the system call in Clock::now() does have the same effect as such a barrier. You could inspect the resulting assembly to be sure.

It is not uncommon, however, that the function under test gets evaluated during compile time. To enforce "realistic" execution, you may need to derive input for foo() from I/O or a volatile read.


Another option would be to disable inlining for foo() - again, this is compiler specific and usually not portable, but would have the same effect.

On gcc, this would be __attribute__ ((noinline))


@Ruslan brings up a fundamental issue: How realistic is this measurement?

Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.

So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.

"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)

Solution 4

The C++ language defines what is observable in a number of ways.

If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.

If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.

While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.

Create a dynamically loaded library. Load it prior to the code in question.

That library exposes one thing:

namespace details {
  void execute( void(*)(void*), void *);
}

and wraps it like this:

template<class F>
void execute( F f ) {
  struct bundle_t {
    F f;
  } bundle = {std::forward<F>(f)};

  auto tmp_f = [](void* ptr)->void {
    auto* pb = static_cast<bundle_t*>(ptr);
    (pb->f)();
  };
  details::execute( tmp_f, &bundle );
}

which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.

Inside the dynamic library, we do:

void details::execute( void(*f)(void*), void *p) {
  f(p);
}

which is pretty simple.

Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.

It can still eliminate foo()s with zero side effects, but you win some, you lose some.

Solution 5

No it can't. According to the C++ standard [intro.execution]:

14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).

Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.

Share:
21,006
S2108887
Author by

S2108887

Updated on July 12, 2022

Comments

  • S2108887
    S2108887 almost 2 years

    Suppose I have a number of statements that I want to execute in a fixed order. I want to use g++ with optimization level 2, so some statements could be reordered. What tools does one have to enforce a certain ordering of statements?

    Consider the following example.

    using Clock = std::chrono::high_resolution_clock;
    
    auto t1 = Clock::now(); // Statement 1
    foo();                  // Statement 2
    auto t2 = Clock::now(); // Statement 3
    
    auto elapsedTime = t2 - t1;
    

    In this example it is important that the statements 1-3 are executed in the given order. However, can't the compiler think statement 2 is independent of 1 and 3 and execute the code as follows?

    using Clock=std::chrono::high_resolution_clock;
    
    foo();                  // Statement 2
    auto t1 = Clock::now(); // Statement 1
    auto t2 = Clock::now(); // Statement 3
    
    auto elapsedTime = t2 - t1;
    
  • M.M
    M.M almost 8 years
    There's still the as-if rule though
  • Revolver_Ocelot
    Revolver_Ocelot almost 8 years
    By as-if rule compiler can do anything to code as long as it does not change observable behavior. Time of execution is not observable. So it can reorder arbutrary lines of code as long as result would be same (most compiler do sensible thing and not reorder time calls, but it is not required)
  • Frédéric Hamidi
    Frédéric Hamidi almost 8 years
    Time of execution is not observable. This is quite strange. From a practical, non-technical point of view, time of execution (a.k.a. "performance") is very observable.
  • Lightness Races in Orbit
    Lightness Races in Orbit almost 8 years
    @FrédéricHamidi: But it's not a metric by which C++ defines program behaviour.
  • Peter
    Peter almost 8 years
    Depends on how you measure time. It is not possible to measure the number of clock cycles taken to execute some body of code in standard C++.
  • Peter
    Peter almost 8 years
    Problem is that the "as if" rule means that this answer doesn't always address the question. The compiler is allowed to reorder statements in order to optimise execution time, as long as the observable output of the program (e.g. to files) doesn't change.
  • Revolver_Ocelot
    Revolver_Ocelot almost 8 years
    I agree that it would be silly, but I wold not call it non-conformant. Compiler can have knowledge what system call on concrete system exactly does and if it has side effects. I would expect compilers to not reorder such call just to cover common use case, allowing for better user experience, not because standard prohibits it.
  • Peter
    Peter almost 8 years
    Might want to note that it is possible to reduce the odds of the compiler reordering statements. For example, calling a function that is defined in another compilation reduces the odds of statements before the function call being shifted so they are executed after - or vice versa. Similarly with inline assembler (although I've seen mention that gcc is smart enough to understand inline assembler in some cases). The only way to be sure, however, is to examine code emitted by the compiler.
  • Admin
    Admin almost 8 years
    I don't buy the reasoning for the "as if' rule -- the same logic would imply that the compiler can't invoke it to do any optimization between two calls to the clock either, or really, to do any optimizations at all.
  • Smeeheey
    Smeeheey almost 8 years
    @Hurkyl No, I would put it differently - any optimisation between clock calls stays between clock calls.
  • Revolver_Ocelot
    Revolver_Ocelot almost 8 years
    Execution time is not observable so "observable behaviour clearly would be affected" is wrong. In addition I would expect, say Microsoft compiler compiling for Win system to have an extensive knowledge about system calls and be able to optimise them, so previous paragraph is questionable too.
  • Smeeheey
    Smeeheey almost 8 years
    @Revolver_Ocelot - fine, I have to agree that I haven't justified the statement "observable behaviour would be affected" enough, so I removed it. However, the rest of the point still holds. I would be very surprised if any compiler allowed itself this optimisation.
  • Smeeheey
    Smeeheey almost 8 years
    Note I'm not claiming it is theoretically impossible, I'm talking about a practical point here.
  • Revolver_Ocelot
    Revolver_Ocelot almost 8 years
    I agree about impracticality of such behavior. I was talking from language lawyer position.
  • Cody Gray
    Cody Gray almost 8 years
    @revolver It does not. Microsoft's compiler assumes nothing about calls to Win32 API functions. It is a bit smarter about library functions, providing intrinsics and inlining code, etc., but GCC and Clang both do exactly the same thing there. It could not possibly do so, otherwise a compiler would be tied to a particular version of the operating system.
  • Ruslan
    Ruslan almost 8 years
    @Peter gcc never analyzes inline assembly code. It can only do some optimizations based on data in extended asm syntax — clobber list, inputs, outputs etc.
  • Ruslan
    Ruslan almost 8 years
    Your hack with asm affects execution time of the statements between timer calls: the code after memory clobber has to reload all variables from memory.
  • Lightness Races in Orbit
    Lightness Races in Orbit almost 8 years
    @Revolver_Ocelot: Optimisations that change the semantics of the program (okay, save for copy elision) are non-compliant to the standard, whether you agree or not.
  • Cody Gray
    Cody Gray almost 8 years
    So does MSVC and ICC. Clang is the only one that seems to preserve the original sequence.
  • dbanet
    dbanet almost 8 years
    @Cody Microsoft compilers are tied to a particular version of operating system. For example, you cannot generate a Win16 application with the latest VS. Worse yet, you probably will have a hard time making the latest VS generate something that would run on NT 4. Hell, I think they already dropped XP, and that is NT 5.
  • Cody Gray
    Cody Gray almost 8 years
    @dba You're mixing a few things together. The linker can no longer generate Win16 applications, that's true enough, but that's because they have removed support for generating that type of binary. WIn16 apps don't use the PE format. That doesn't imply that either the compiler or linker has special knowledge about the API functions. The other issue is related to the runtime library. There is absolutely no problem getting the latest version of MSVC to generate a binary that runs on NT 4. I have done it. The problem comes as soon as you try to link in the CRT, which calls functions not available.
  • peterchen
    peterchen almost 8 years
    @Ruslan: Their hack, not mine. There are different levels of purging, and doing something like that is unavoidable for reproducible results.
  • dbanet
    dbanet almost 8 years
    @Cody doesn't the CRT get linked dynamically before execution on the target platform? Anyway, I was not claiming what I just said necessarily implies MSVC assumes something about the calls to Win32, I am just showing the contrary to the consequent is not implied by what you said, by counterexample.
  • paulm
    paulm almost 8 years
    With asm volatile("" ::: "memory"); does it prevent the ordering?
  • Cody Gray
    Cody Gray almost 8 years
    @dba The CRT can be linked either statically or dynamically, but that is not relevant. The problem is that there are different versions of the CRT, and newer versions of the CRT call Windows API functions that are not available on older versions of the OS. Like the version of the CRT bundled with VS 2010 calls EncodePointer, which doesn't exist on Win 2K. If you linked the CRT statically, then it would fail because EncodePointer doesn't exist. If you linked it dynamically, then it would fail because the CRT DLL is missing. What I'm saying is that you have not provided a counterexample.
  • Cody Gray
    Cody Gray almost 8 years
    If you use a different version of the CRT, one that supports Win NT 4, then it will work fine, no matter which version of the compiler that you compile it with. Or if you stub out EncodePointer, providing your own implementation of it. From the compiler's perspective, the CRT, the Windows API, and your own functions are all black boxes. It has no special knowledge about what they do or what they mean. They're just functions that conform to a particular ABI.
  • dbanet
    dbanet almost 8 years
    @Cody you have now just produced an elaboration on my counterexample to the belief that (especially Microsoft's) compilers are not tied to operating system versions (like, come on, MSVC is limited to one flavour of NT (well, not true anymore since they have Windows Phone and Surface and so on, which are ARM)). You cannot prove MSVC assumes nothing about calls to Win32 API function from a belief for which a valid counterexample exists.
  • David Cary
    David Cary almost 8 years
    +1 for a good example of reordering. But I wish you added another sentence about how to prevent that from happening. Can we chop this up into several small functions, and can we assume one function finishes before the next one starts even if it's an inline function, or do we need "volatile" or "std::atomic_thread_fence" to really be sure statements occur in a particular order?
  • Niall
    Niall almost 8 years
    With the code more as presented in the OP, gcc seems to do the right thing; godbolt.org/g/C93rRK
  • Cody Gray
    Cody Gray almost 8 years
    @dba I guess I don't understand what you are saying. I still feel like you are mixing together multiple concepts. The linker is designed to generate a single type of binary, a binary format supported only by Windows NT. Same for the x86 instruction set. But that has nothing to do with having special knowledge about API functions and optimizing accordingly. It does not do that. The runtime library uses API functions, but so can your code. The compiler doesn't care; it makes no distinction between your code and the CRT's code. The statement of mine that you quoted is still absolutely valid.
  • Jeremy
    Jeremy almost 8 years
    @Niall - As long as foo() is in a separate compilation unit this seems to be the case. If foo() is defined in the same compilation unit the reordering still happens - presumably because in this latter case the compiler can determine that foo() is not affected by any potential side effects of now(), whereas in the former case it can't.
  • S2108887
    S2108887 almost 8 years
    I agree with David Cary that this is a good example and I would also like to see information about ways to avoid reordering.
  • dbanet
    dbanet almost 8 years
    @Cody it's sad you don't understand what am I saying and keep digging into the semantics. I'm not saying what I quoted is wrong, I'm saying its premise is wrong. Additionally I think @​revolver's original expectation of compilers made exclusively for a platform being able to figure stuff about that platform out is reasonable and sound, at least theoretically, if not even practically, as previously shown.
  • phuclv
    phuclv almost 8 years
    you don't use t1 and t2 anywhere so it may think the result can be discarded and reorder the code
  • Niall
    Niall almost 8 years
    Even if the time results are used, the reordering happens in the presence of a definition of the function between the two time calls... godbolt.org/g/EmvZMy
  • Jeremy
    Jeremy almost 8 years
    @Niall - I can't offer anything more concrete, but I think my comment alludes to the underlying reason: The compiler knows that foo() cannot affect now(), nor vice versa, and so does the reordering. Various experiments involving extern scope functions and data seem to confirm this. This includes having static foo() depend on a file-scope variable N - if N is declared as static, reordering occurs, whereas if it's declared non-static (i.e. it's visible to other compilation units, and hence potentially subject to side effects of extern functions such as now()) reordering does not occur.
  • Jeremy
    Jeremy almost 8 years
    @ Lưu Vĩnh Phúc: Except that the calls themselves are not elided. Once again, I suspect this is because the compiler doesn't know what their side effects might be - but it does know that those side effects cannot influence the behaviour of foo().
  • Jeremy
    Jeremy almost 8 years
    And a final note: specifying -flto (link-time optimisation) causes reordering even in otherwise non-reordered cases.
  • Yakk - Adam Nevraumont
    Yakk - Adam Nevraumont almost 8 years
    In the trivial case of int x = 0; clock(); x = y*2; clock(); there are no defined ways for the clock() code to interact with the state of x. Under the C++ standard, it doesn't have to know what clock() does -- it could examine the stack (and notice when the computation occurs), but that isn't C++'s problem.
  • Lightness Races in Orbit
    Lightness Races in Orbit almost 8 years
    A declaration of an integer, and a call to some foo(), are not in any way the same thing.
  • Jeremy
    Jeremy almost 8 years
    @Lightness - Agreed, hence my subsequent comments.
  • Cody Gray
    Cody Gray almost 8 years
    "another approach is to attempt to outsmart your compiler" If that phrase isn't a sign of having gone down the rabbit hole, I don't know what is. :-)
  • Jeremy
    Jeremy almost 8 years
    Modified my example to match the question. Results and conclusions (most of which were based on this modified code anyway) remain valid.
  • Voo
    Voo almost 8 years
    @dbanet I'm not sure either what you're trying to argue. You clearly stated "Microsoft compilers are tied to a particular version of operating system". This is provably wrong, since you can easily generate binaries with VS2015 that will run on five different operating systems (and VS2008 can generate perfectly fine binaries for Win10 - clearly without a time machine that'd be rather complicated to pull off).
  • Voo
    Voo almost 8 years
    If your claim is that MSVC has hardcoded information about API calls in it.. extraordinary claims require extraordinary proof. I mean think about the maintenance hassle that would imply and the little advantage it would bring (the kernel and the compiler are written by two completely different teams at MS, they don't even know each other for the most part). Different versions of MSVC have different intrinsics and behavior and the compiler is generally intimately linked to the runtime (you can get around that though) but that's about it.
  • supercat
    supercat almost 8 years
    I think it might be helpful to note that the time required for a block of code to execute is not considered an "observable" behavior which compilers are required to maintain. If the time to execute a block of code were "observable", then no forms of performance optimization would be permissible. While it would be helpful for C and C++ to define a "causality barrier" which would require a compiler to hold off on executing any code after the barrier until all side-effects from before the barrier had been handled by the generated code [code which wants to ensure that data has fully...
  • supercat
    supercat almost 8 years
    ...propagated through hardware caches would need to use hardware-specific means to do that, but a hardware-specific means of waiting until all posted writes were complete would be useless without a barrier directive to ensure that all pending writes tracked by the compiler must get posted to the hardware before the hardware is asked to ensure that all posted writes are complete.] I know of no way of doing that in either language without using a dummy volatile access or call to outside code.
  • BartekChom
    BartekChom almost 8 years
    auto fred(volatile int x) { auto t1 = std::chrono::high_resolution_clock::now(); volatile int y = foo(x); auto t2 = std::chrono::high_resolution_clock::now(); return t2-t1; } seems not to become reordered.
  • BartekChom
    BartekChom almost 8 years
    Sorry, in C++11 we need decltype(std::chrono::high_resolution_clock::now()-std::chro‌​no::high_resolution_‌​clock::now()) fred(volatile int x) instead of auto fred(volatile int x).
  • Steve Jessop
    Steve Jessop almost 8 years
    To take Yakk's point further: it's true that re-ordering the system calls, so that the result of the first is assigned to t2 and the second to t1, would be non-conforming and silly if those values are used, what this answer misses is that a conforming compiler can sometimes re-order other code across a system call. In this case, provided it knows what foo() does (for example because it has inlined it) and hence that (loosely speaking) it's a pure function then it can move it around.
  • Steve Jessop
    Steve Jessop almost 8 years
    .. again loosely speaking, this is because there's no guarantee that the actual implementation (albeit not the abstract machine) won't speculatively calculate y*y before the system call, just for fun. There is also no guarantee that the actual implementation won't use the result of this speculative calculation later at whatever point x is used, therefore doing nothing between the calls to clock(). The same goes for whatever an inlined function foo does, provided it has no side-effects and cannot depend on state that might be altered by clock().
  • Chris Steele
    Chris Steele almost 8 years
    Reordering of assembly often occurs with the higher optimization settings, and is often done by analyzing memory/register access independant of the original source code. As well, some sets of less-optimal instructions can be replaced by a more optimal set of instructions. These are sometimes called 'peep-hole optimizations', because the optimizer is only looking at that small section of instructions in an effort to find patterns that can be optimized.
  • Kieren Johnstone
    Kieren Johnstone almost 8 years
    How does this answer the question?
  • PJTraill
    PJTraill almost 8 years
    +1 for being the only answer so far to refer to the standard (whether or not the interpretation is correct!).
  • PJTraill
    PJTraill almost 8 years
    Can you back up Moving a system call that obtains the current time obviously does not satisfy that condition (i.e. without changing semantics) . A compiler that … does so is non-compliant … . with a quote? It all depends what the standard means by “semantics” (!) (if it uses that term); I agree that time should be considered observable but the problem is, I think, that foo() means “make the results defined according to the standard by the implementation of foo available to ‘subsequent’ code” — and results defined according to the standard tend not to include elapsing of time.
  • dbanet
    dbanet almost 8 years
    @Voo please re-read my comments starting with the first one which precisely addresses your issue.
  • dbanet
    dbanet almost 8 years
    On your second comment: no, that is not my claim, please, again, read carefully. I am not stating anything I have not stated before, but here it still goes: @Cody's claim was X. He tried to show X by showing it follows from Y. But that does not work, because Y is not true. I'm not claiming anything about X, I'm only claiming you cannot show X using Y, because Y itself doesn't hold.
  • Cody Gray
    Cody Gray almost 8 years
    Not sure why I needed to get pinged back into this discussion, but if Y is the initial claim that you linked to, I'm pretty sure I have already conclusively disproven that. None of Microsoft's compilers are tied to a specific OS version. The linker is tied to a specific executable format/specification, which happens to only be supported by (32-bit and later) Windows NT, but that has nothing to do with the compiler and requires absolutely no internal knowledge of the OS's workings, since this is an open standard. And the runtime library makes OS-specific calls, but so too does user code.
  • SomeWittyUsername
    SomeWittyUsername almost 8 years
    Should be noted though, that no matter what the compiler does or does not, there is also a processor reordering which can also have surprising effects. That is less visible with Intel processors than with ARM processors.
  • Chandler Carruth
    Chandler Carruth almost 8 years
    Note that the hack with 'asm' only helps as a barrier for operations which touch memory, and the OP is interested in more than that. See my answer for more details.
  • S2108887
    S2108887 over 7 years
    Thank you for your answer. I have marked it as the new best answer. I could have done this earlier, but I have not read this stackoverflow page for many months. I am very interested in using the Clang compiler to make C++ programs. Among other things, I like that one can use Unicode characters in variable names in Clang. I think I will ask more questions about Clang on Stackoverflow.
  • MikeMB
    MikeMB about 7 years
    While I understand how this prevents foo being optimized away completely, can you elaborate a bit why this prevents the calls to Clock::now() being reordered relative to foo()? Does the optimzer have to assume that DoNotOptimize and Clock::now() have access to and might modify some common global state which in turn would tie them to the in- and output? Or are you relying on some current limitations of the optimizer's implementation?
  • Chandler Carruth
    Chandler Carruth about 7 years
    DoNotOptimize in this example is a synthetically "observable" event. It is as if it notionally printed visible output to some terminal with the input's representation. Since reading the clock is also observable (you're observing time passing) they cannot be re-ordered without changing the observable behavior of the program.
  • Admin
    Admin over 5 years
    I'm still not quite clear with the concept "observable", if the foo function is doing some operations like reading from a socket which may be blocked for a while, does this count a observable operation? And since the read is not a "totally known" operation (right?), will the code keep in order?
  • Dave
    Dave about 5 years
    "The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation." But it seems to me that the issue isn't the semantics of integer addition, it's the semantics of calling function foo(). Unless foo() is in the same compilation unit, how does it know that foo() and clock() don't interact?
  • Ciro Santilli OurBigBook.com
    Ciro Santilli OurBigBook.com almost 5 years
    @MikeMB until someone comes with a precise quote from the C++ standard, I would assert that the first reordering cannot happen by passing t1 as a new input parameter to DoNotOptimize. But I'm not sure how to prevent the reordering of the second one since Clock::now() takes no inputs! I think I would move it into a __attribute__ ((noinline)) function that takes an input. Or maybe call ::now from assembly, but that is hard: stackoverflow.com/questions/37502841/…
  • Ciro Santilli OurBigBook.com
    Ciro Santilli OurBigBook.com almost 5 years
    OK, I went ahead and created a paranoid example with full data dependencies which I'm pretty sure cannot be reordered: stackoverflow.com/questions/37786547/…
  • Peter Cordes
    Peter Cordes over 4 years
    You normally don't need to force a spill/reload with "+m", using "+r" is a much more efficient way to make the compiler materialize a value and then assume the variable has changed.
  • cmaster - reinstate monica
    cmaster - reinstate monica over 4 years
    The only problem that's left with DoNotOptimize() is, that the input needs to be reloaded from memory and the output needs to be written to memory within the critical region. For a simple one-cycle-latency instruction like addl, that's significant overhead. Just like the calling of the timer function is significant overhead. I think, the only way to eliminate these overheads from your measurement is to time a loop containing the operation of interest (with the input and output properly DoNotOptimize()ed).
  • ABaumstumpf
    ABaumstumpf about 4 years
    DoNotOptimize works cause volatile access enforces strict sequentiality.
  • supercat
    supercat about 3 years
    @cmaster-reinstatemonica: Even a call via volatile-qualified function pointer need not always enforce ordering. A compiler could eagerly perform some computations and store results in a private place before reading the volatile-qualified pointer, check whether the pointer identifies a particular function the compiler has optimized for this particular call site, and either call the ordinary version of function identified by the pointer or invoke a version which is optimized for the particular call site, and which exploits the eagerly-performed computations.
  • cmaster - reinstate monica
    cmaster - reinstate monica about 3 years
    @supercat The point about DoNotOptimize() is that it declares the memory behind its argument as being changed by the assembler code. As such, the compiler is forced to a) reload the input value from memory after the call, b) to not let any computation within the critical section rely on a stale version of the input, and c) to write the result to memory before the second DoNotOptimize() call. All computations that rely on input and are relevant to output must be sequenced between these two calls. If you fail to include I/O variables in the DoNotOptimize() calls, that's your mistake.
  • supercat
    supercat over 2 years
    @cmaster-reinstatemonica: Would anything forbid a "clever" compiler from transforming int x = operationThatProduces5(); DoNotOptimize(x); doSomething(x); into int x = operationThatProduces5(); DoNotOptimize(x); if (x == 5) doSomething(5); else doSomething(x);?
  • cmaster - reinstate monica
    cmaster - reinstate monica over 2 years
    @supercat As long as the x in if(x == 5) is reloaded from memory after the DoNotOptimize(x) call, the compiler is free to do any kind of insane stuff to perform the doSomething(x) call. So, yes, it would be possible for the CPU to speculatively execute doSomething(5) before x has been reloaded from memory. However, the compiler has no reason to believe that x == 5 would remain true after the DoNotOptimize(x) call, and will not introduce the danger of a pipeline flush due to misprediction of if(x == 5) if it is any good.
  • supercat
    supercat over 2 years
    @cmaster-reinstatemonica: In many use cases, after a sequence like the above, the value of x would be 5 more than 25% of the time, if not more than 50%. Thus, having a compiler evaluate the cost of a hard-coded x==5 variant and perform the comparison/substitution if it would be 80% or more cheaper than the general-case version could be a very sensible optimization if the compiler didn't use the text of the Standard as an process code in a way contrary to the intention of the Standard's authors as documented in the Rationale. Fundamentally, the Standard was never designed to draw...
  • supercat
    supercat over 2 years
    ...a clear line between constructs that all implementations must support and those that will never be used by non-broken programs, because no single partitioning could be suitable for more than a small fraction of the purposes to which C is put. While some compiler maintainers seem to need bright-line rules, such rules could only be written sensibly if the Standard weakened a couple of long-standing principles, most notably: 1. Replace the idea that all meaningfully-conforming programs should be expected to run meaningfully on all implementations with a requirement that...
  • supercat
    supercat over 2 years
    ...conforming implementations must reject any meaningfully-conforming programs that they cannot otherwise process meaningfully, so constructs that are needed for some tasks but not others could be used in meaningfully-conforming programs to accomplish those tasks, but implementations not intended for those tasks need not support them. 2. Use an abstraction model in which certain optimizations may have observable effects; a program whose actions make such effects observable would be a correct program if all allowable combinations of effects would yield behavior meeting requirements.
  • cmaster - reinstate monica
    cmaster - reinstate monica over 2 years
    @supercat What does the standard have to do with that? Afaik, the contents of an asm statement is not standardized in any way. However, the compilers do describe its syntax and semantics, especially of the "+m" directive: It means that the compiler will consider the memory behind the pointer to be both input and output of the (empty) assembler code, and as such, the compiler must assume that the value of x does change. To the compiler, the chance of x remaining the same is only 2^-64, assuming a 64 bit value. It should definitely no try to optimize that.
  • supercat
    supercat over 2 years
    If e.g. code uses a float* to store a value, and then uses a compiler-defined barrier, and then uses a uint32_t* to read the bits of the storage, the Standard would not recognize any effect the barrier might have in preventing the read from invoking Undefined Behavior. If, however, the Standard were to replace N1570 6.5p7 with a rule that instead said that a read of a uint32_t* may generally be regarded as unsequenced relative to a preceding or following write of type float*, then the effects of the barrier would be clear even if the Standard knew nothing about it.
  • supercat
    supercat over 2 years
    @cmaster-reinstatemonica: Further, although a compiler wouldn't be entitled to ignore the possibility of x changing in your barrier scenario, a compiler that processes x = 5; barrier(); doSomething(x); as x = 5; barrier(); if (x==5) doSomething(5); else doSomething(x); would not be making a correctness-threatening assumption that x won't change, but rather a performance-affecting assumption that x won't always change, possibly generating code that will perform worse than straightforward code if x changes more than 75% of the time, but better if it doesn't.
  • cmaster - reinstate monica
    cmaster - reinstate monica over 2 years
    @supercat Ah, now I see why you see the standard relevant in this. As to the "optimization" that the compiler is allowed to do: Yes, the compiler has every right to emit the if(x == 5) code you gave. Nevertheless, it would be a pretty daft thing to do: The whole point point of the asm statement part after the : is to allow the savy programmer to specify what the compiler can assume, and what it cannot assume. This syntax is not used all over the place (= low optimization necessity), and is used by pretty hard-core low-level programmers who can read and use specs (= best treated as gods).
  • supercat
    supercat over 2 years
    @cmaster-reinstatemonica: Many of the "optimizations" that programmers complain about today would have been universally recognized as downright obtuse when the Standard is written, and thus there was no need to have the Standard forbid them. Compiler writers, however, often seem to care much more about optimizations than their users do, and regard the fact that a construct would block an optimization the they want to perform as a being problem with the construct, rather than being the reason it was used in the first place.