Concurrency: Atomic and volatile in C++11 memory model

42,835

Solution 1

Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.

If you have a global variable which is shared between threads, such as:

std::atomic<int> ai;

then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.

In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.

The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.

If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.

If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.

If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.

The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.

Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.

Solution 2

volatile and the atomic operations have a different background, and were introduced with a different intent.

volatile dates from way back, and is principally designed to prevent compiler optimizations when accessing memory mapped IO. Modern compilers tend to do no more than suppress optimizations for volatile, although on some machines, this isn't sufficient for even memory mapped IO. Except for the special case of signal handlers, and setjmp, longjmp and getjmp sequences (where the C standard, and in the case of signals, the Posix standard, gives additional guarantees), it must be considered useless on a modern machine, where without special additional instructions (fences or memory barriers), the hardware may reorder or even suppress certain accesses. Since you shouldn't be using setjmp et al. in C++, this more or less leaves signal handlers, and in a multithreaded environment, at least under Unix, there are better solutions for those as well. And possibly memory mapped IO, if you're working on kernal code and can ensure that the compiler generates whatever is needed for the platform in question. (According to the standard, volatile access is observable behavior, which the compiler must respect. But the compiler gets to define what is meant by “access”, and most seem to define it as “a load or store machine instruction was executed”. Which, on a modern processor, doesn't even mean that there is necessarily a read or write cycle on the bus, much less that it's in the order you expect.)

Given this situation, the C++ standard added atomic access, which does provide a certain number of guarantees across threads; in particular, the code generated around an atomic access will contain the necessary additional instructions to prevent the hardware from reordering the accesses, and to ensure that the accesses propagate down to the global memory shared between cores on a multicore machine. (At one point in the standardization effort, Microsoft proposed adding these semantics to volatile, and I think some of their C++ compilers do. After discussion of the issues in the committee, however, the general consensus—including the Microsoft representative—was that it was better to leave volatile with its orginal meaning, and to define the atomic types.) Or just use the system level primitives, like mutexes, which execute whatever instructions are needed in their code. (They have to. You can't implement a mutex without some guarantees concerning the order of memory accesses.)

Solution 3

Here's a basic synopsis of what the 2 things are:

1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.

2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Solution 4

Volatile and Atomic serve different purposes.

Volatile : Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.

Atomic : It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

Share:
42,835
Abhijit-K
Author by

Abhijit-K

Skills: Architecture, Programming, web, mobile, databases Languages: Go, Rust, Javascript, swift, Java, C++ LinkedIn Profile

Updated on June 10, 2021

Comments

  • Abhijit-K
    Abhijit-K almost 3 years

    A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

  • Kerrek SB
    Kerrek SB over 12 years
    I think it's worth emphasising that atomics (as well as mutex primitives) cannot possibly be implemented with some sort of hardware support. (On the other hand, volatile is a mere compiler hint.)
  • Björn Pollex
    Björn Pollex over 12 years
    You meant without, didn't you?
  • James Kanze
    James Kanze over 12 years
    @KerrekSB Nothing can be implemented without some sort of hardware support:-). The language defines the semantics (more or less, in the case of volatile); it's up to the compiler to generate whatever is necessary. (Arguably, the intent of volatile would require some extra machine instructions on many machines, since memory access reordering in the CPU will also affect memory mapped IO.)
  • Kerrek SB
    Kerrek SB over 12 years
    @JamesKanze: Hm, yes, that's true of course. I suppose what I should have said is that atomics cannot be implemented as pure library code that only uses the rest of the standard library, if that makes sense. That is, the addition of atomics (whether in C++ or delegated through the C library) requires platform support and isn't just added library code. (You'll need some variant of atomic compare-and-swap at the hardware level.)
  • James Kanze
    James Kanze over 12 years
    @KerrekSB Atomics require either special compiler support or inline assembler, yes. volatile also only works because of special compiler support. The difference is that that atomics are specified with an interface corresponding to a library function, atomics aren't. (And what about thread_local, which often requires significant OS support, compared to the other storage classes?)
  • ildjarn
    ildjarn over 12 years
    "At one point in the standardization effort, Microsoft proposed adding these semantics to volatile, and I think some of their C++ compilers do." This is correct -- with VC++ 2003 and onward, volatile implies a full memory barrier.
  • AProgrammer
    AProgrammer over 12 years
    @JamesKanze, I think the intend is that the cache for the pages is set to a suitable mode. I know that for SPARC the spec mandates some additional instructions that compilers don't use (I've always wondered if there where actual implementation of SPARC where they were needed, if someone knows...). Do you know others processors where setting up the MMU isn't enough?
  • James Kanze
    James Kanze over 12 years
    @AProgrammer It's not just an issue with the cache. The CPU itself has a write and a read pipeline, and may reorder things within it. The one case where I know that this caused an actual problem was on an Alpha; I don't know if there are actual Sparc's which require it (the Sparc architecture specifications allow it), and people have just been lucky, or not. (Threading issues are often like that. There's a definite bug in g++'s implementation of std::string, but I've never heard of it causing an actual problem. Yet.)
  • AProgrammer
    AProgrammer over 12 years
    @JamesKanze, I know about write and read buffering -- but I'd tend to bypass them for nocache pages in the hardware itself, but I'm too much a software guy. I'm not surprised that Alpha was an architecture where software had to pay attention to it. It generally took a very agressive position on such kind of issues, and computer architects have generally taken a step or two back.
  • Bartosz Milewski
    Bartosz Milewski over 12 years
    I would add that the question is not well defined. Whether a value is stale or not depends on additional synchronization. If you have one variable and one thread keeps writing a sequence of values to it (e.g., 1, 2...), and another thread reads, say 3, is the 3 a stale value or not? You need some other confirmation that indeed 4 has already been written when you read 3. But that requires some other observation that is in a happens-before relationship with your read of 3. Somebody must have communicated to you the read of 4 before your read or 3. This won't happen with SC accesses.
  • Bartosz Milewski
    Bartosz Milewski over 12 years
    I's still struggling with the use of volatile with atomics: "volatile is completely unnecessary when used with std::atomic". What about the loop optimization of while(x.load(memory_order_relaxed)) ; => bool tmp = x.load(memory_order_relaxed); while(tmp) ; The standard is wishy washy about this and Hans turns into a diplomat when asked this question directly ;-)
  • Abhijit-K
    Abhijit-K over 12 years
    VS2005 will put harware level memory barier for volatile variable and no re-ordering by the processor, prior to vs2005 for multi-core env. we need to use Interlocked* API's. Is this in accordance with ISO C++ memory model? Atomic is to make sure we always get some valid value doesn't matter it is latest but it is never corrupt. We also have some operations with atomic access that can provide memory barrier to prevent reordering of instructions.What is difference in memory barrier put by atomic vs volatile? Is is that the volatile read is a outbound call to the main memory(e.g DRAM) always?
  • Abhijit-K
    Abhijit-K over 12 years
    It's not together I mean. Volatile VS atomic. VS2005 puts a HW memory barrier for volatile var and no need of Interlocked* APIs. Atomic with memory_order_acquire would do the same thing. In case of volatile are the read/write OPS outbound call into the main memory(DRAM)? About additonal syncronization to read values from a shared variable, suppose the write thread does write and notify operations. The read thread WAITs and on notify resumes.When the read thread resumes, is it still possible that the latest value still is in the cache of core running write thread? Continued below..
  • Abhijit-K
    Abhijit-K over 12 years
    Continued from above...If volatile calls are always in main memory then it atleast confirms me that this sync may work? Or in case of notify always the cache is flushed in main memory and when the other core loads the post notify instructions it has this latest values? OR critical section is only way to do this sync? @Bartosz, about Hans, I think he will be speaking about Threads and shared variables in C++11 at Microsoft's going native 2012 which might be helpful for me to see on channel 9.
  • Abhijit-K
    Abhijit-K over 12 years
    @Anthony, thanks for the explanation and advice. For production I am not thinking about transactions with atomics. When we do a compare and swap like operation with atomics. What we compare with? Which location? The main memory (DRAM)? Or the value is forced flushed from the other threads core's cache and then compared? In case of modern procs like Intel i7 & AMD phenon with L3 CACHE, is the value compared from L3 cache which is common for all cores so that things are faster?
  • Anthony Williams
    Anthony Williams over 12 years
    Unless you set specific flags on a region of memory, all operations done by the CPU are against the value in its cache. The CPU will read and write that cache line to main memory as it sees fit in order to ensure that the synchronization constraints are met. If you use a wait and a notify then there should be enough synchronization in the facility you use to ensure that after returning from the wait, the waiting thread will see any value written by the notifying thread prior to the notify.
  • Anthony Williams
    Anthony Williams over 12 years
    VS does NOT issue a memory barrier instruction for volatile accesses. What they do is forward the processor's guarantee that loads are always load-acquire and stores are always store-release up to the C/C++ code level, and inhibit certain optimizations so that this is the case. This is not the same as a std::atomic with std::memory_order_seq_cst, which does require an MFENCE or LOCKed instruction along with stores and/or loads.
  • Anthony Williams
    Anthony Williams over 12 years
    @BartoszMilewski: I don't believe that the loop optimization you ask about is valid. The compiler may change it to a read every N iterations for some large N, but may not remove the read from the loop entirely, as it would violate the "should be visible within a reasonable time period" clause. It's only a "should", not a "must", but I cannot see any reasonable implementation violating it.
  • Anthony Williams
    Anthony Williams over 12 years
    @BartoszMilewski: My interpretation of "stale" is that a later value has been written by some thread than the one read. Unless there is some synchronization constraint that makes the later read observable then this isn't a problem.
  • Anthony Williams
    Anthony Williams over 12 years
    On MSVC, volatile is atomic for aligned integer reads and writes, since the CPU guarantees that. But yes: don't use volatile for concurrency. Use std::atomic<>.
  • Abhijit-K
    Abhijit-K over 12 years
    opps deleted comment accidently hence putting it back. this is before Anthony's comment above: @Anthony, thanks. So this means volatile can go out of order , nor does it guarantee atomic access, only do load-acquire from memory to read the current value, similarly for write. Thus not so useful for concurrency. Can be used as a condition variable for simple stuff.
  • southerton
    southerton almost 9 years
    "The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add()." - this is only true when std::memory_order_relaxed or std::memory_order_acquire used, right? Because the default value for atomic::store() is memory_order_seq_cst, and that flag synchronizes all visible side effects. So I think the latest value would be immediately visible to other threads if memory_order_seq_cst is used. There shouldn't be a staled value.
  • underscore_d
    underscore_d almost 9 years
    Atomic operations very often abtract to a single hardware clock cycle, but afaik this is neither 100% guaranteed/necessary, nor the only requirement (e.g. many architectures require certain alignments too). One should always declare intent, not depend on 'well it works the right way on X architecture'.
  • LWimsey
    LWimsey over 7 years
    @southerton The only guarantee seq_cst gives is that a total order exists in which all cores observe modifications in the same order. A single seq_cst load() can still return a stale value though.
  • Alexander Torstling
    Alexander Torstling about 6 years
    Except that volatile might force the compiler to put a variable in a register, so that it doesn't read it twice when the coder only instructed to read once.
  • Antoine Morrier
    Antoine Morrier almost 5 years
    "The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add()". If the variable is flagged with volatile: volatile atomic, we should not read a stale value, right? Or do we need a read modify write to have the guarantee to read the latest value ?
  • curiousguy
    curiousguy almost 5 years
    @AntoineMorrier "stale value" is an ill defined concept
  • Anthony Williams
    Anthony Williams almost 5 years
    @AntoineMorrier volatile atomic gains you very little. It guarantees that the compiler issues the read, but does not affect ordering guarantees. You need read-modify-write ops to guarantee the "latest" value.
  • Peter Cordes
    Peter Cordes almost 3 years
    @AnthonyWilliams: Can you please rephrase or clarify what exactly using an RMW gives you? This answer's suggestion of needing RMW for "latest" has led to at least one Q&A like relaxed ordering and inter thread visibility where someone thought that using a dummy CAS instead of x.store or x.load would speed up visibility for a single flag with a single writer. (And I seem to recall another case of someone citing this answer for a similar wrong claim).
  • Peter Cordes
    Peter Cordes almost 3 years
    If the RMW runs before the store commits to cache, it doesn't see the value, just like if it was a pure load. As you say, does that mean "stale"? No, it just means that the store hasn't happened yet. (With seq_cst to make other operations in that thread wait for global visibility, "happens" is somewhat definable, moreso than with release or weaker where the store buffer comes into play, letting later loads (and maybe stores) in the writer happen before it becomes globally visible.)
  • Anthony Williams
    Anthony Williams almost 3 years
    I've added some clarification.