Is volatile expensive?

java memory-management concurrency volatile

23,975

Solution 1

On Intel an un-contended volatile read is quite cheap. If we consider the following simple case:

public static long l;

public static void run() {        
    if (l == -1)
        System.exit(-1);

    if (l == -2)
        System.exit(-1);
}

Using Java 7's ability to print assembly code the run method looks something like:

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb396ce80: mov    %eax,-0x3000(%esp)
0xb396ce87: push   %ebp
0xb396ce88: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 33)
0xb396ce8e: mov    $0xffffffff,%ecx
0xb396ce93: mov    $0xffffffff,%ebx
0xb396ce98: mov    $0x6fa2b2f0,%esi   ;   {oop('Test2')}
0xb396ce9d: mov    0x150(%esi),%ebp
0xb396cea3: mov    0x154(%esi),%edi   ;*getstatic l
                                    ; - Test2::run@0 (line 33)
0xb396cea9: cmp    %ecx,%ebp
0xb396ceab: jne    0xb396ceaf
0xb396cead: cmp    %ebx,%edi
0xb396ceaf: je     0xb396cece         ;*getstatic l
                                    ; - Test2::run@14 (line 37)
0xb396ceb1: mov    $0xfffffffe,%ecx
0xb396ceb6: mov    $0xffffffff,%ebx
0xb396cebb: cmp    %ecx,%ebp
0xb396cebd: jne    0xb396cec1
0xb396cebf: cmp    %ebx,%edi
0xb396cec1: je     0xb396ceeb         ;*return
                                    ; - Test2::run@28 (line 40)
0xb396cec3: add    $0x8,%esp
0xb396cec6: pop    %ebp
0xb396cec7: test   %eax,0xb7732000    ;   {poll_return}
;... lines removed

If you look at the 2 references to getstatic, the first involves a load from memory, the second skips the load as the value is reused from the register(s) it is already loaded into (long is 64 bit and on my 32 bit laptop it uses 2 registers).

If we make the l variable volatile the resulting assembly is different.

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb3ab9340: mov    %eax,-0x3000(%esp)
0xb3ab9347: push   %ebp
0xb3ab9348: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 32)
0xb3ab934e: mov    $0xffffffff,%ecx
0xb3ab9353: mov    $0xffffffff,%ebx
0xb3ab9358: mov    $0x150,%ebp
0xb3ab935d: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab9365: movd   %xmm0,%eax
0xb3ab9369: psrlq  $0x20,%xmm0
0xb3ab936e: movd   %xmm0,%edx         ;*getstatic l
                                    ; - Test2::run@0 (line 32)
0xb3ab9372: cmp    %ecx,%eax
0xb3ab9374: jne    0xb3ab9378
0xb3ab9376: cmp    %ebx,%edx
0xb3ab9378: je     0xb3ab93ac
0xb3ab937a: mov    $0xfffffffe,%ecx
0xb3ab937f: mov    $0xffffffff,%ebx
0xb3ab9384: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab938c: movd   %xmm0,%ebp
0xb3ab9390: psrlq  $0x20,%xmm0
0xb3ab9395: movd   %xmm0,%edi         ;*getstatic l
                                    ; - Test2::run@14 (line 36)
0xb3ab9399: cmp    %ecx,%ebp
0xb3ab939b: jne    0xb3ab939f
0xb3ab939d: cmp    %ebx,%edi
0xb3ab939f: je     0xb3ab93ba         ;*return
;... lines removed

In this case both of the getstatic references to the variable l involves a load from memory, i.e. the value can not be kept in a register across multiple volatile reads. To ensure that there is an atomic read the value is read from main memory into an MMX register movsd 0x6fb7b2f0(%ebp),%xmm0 making the read operation a single instruction (from the previous example we saw that 64bit value would normally require two 32bit reads on a 32bit system).

So the overall cost of a volatile read will roughly equivalent of a memory load and can be as cheap as a L1 cache access. However if another core is writing to the volatile variable, the cache-line will be invalidated requiring a main memory or perhaps an L3 cache access. The actual cost will depend heavily on the CPU architecture. Even between Intel and AMD the cache coherency protocols are different.

Solution 2

Generally speaking, on most modern processors a volatile load is comparable to a normal load. A volatile store is about 1/3 the time of a montior-enter/monitor-exit. This is seen on systems that are cache coherent.

To answer the OP's question, volatile writes are expensive while the reads usually are not.

Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast as a normal variable read (disregarding the reordering contraints of volatile)?

Yes, sometimes when validating a field the CPU may not even hit main memory, instead spy on other thread caches and get the value from there (very general explanation).

However, I second Neil's suggestion that if you have a field accessed by multiple threads you shold wrap it as an AtomicReference. Being an AtomicReference it executes roughly the same throughput for reads/writes but also is more obvious that the field will be accessed and modified by multiple threads.

Edit to answer OP's edit:

Cache coherence is a bit of a complicated protocol, but in short: CPU's will share a common cache line that is attached to main memory. If a CPU loads memory and no other CPU had it that CPU will assume it is the most up to date value. If another CPU tries to load the same memory location the already loaded CPU will be aware of this and actually share the cached reference to the requesting CPU - now the request CPU has a copy of that memory in its CPU cache. (It never had to look in main memory for the reference)

There is quite a bit more of protocol involved but this gives an idea of what is going on. Also to answer your other question, with the absence of multiple processors, volatile reads/writes can in fact be faster then with multiple processors. There are some applications that would in fact run faster concurrently with a single CPU then multiple.

Solution 3

In the words of the Java Memory Model (as defined for Java 5+ in JSR 133), any operation -- read or write -- on a volatile variable creates a happens-before relationship with respect to any other operation on the same variable. This means that the compiler and JIT are forced to avoid certain optimisations such as reordering instructions within the thread or performing operations only within the local cache.

Since some optimisations are not available, the resulting code is necessarily slower that it would have been, though probably not by very much.

Nevertheless you shouldn't make a variable volatile unless you know that it will be accessed from multiple threads outside of synchronized blocks. Even then you should consider whether volatile is the best choice versus synchronized, AtomicReference and its friends, the explicit Lock classes, etc.

Solution 4

Accessing a volatile variable is in many ways similar to wrapping access to an ordinary variable in a synchronized block. For instance, access to a volatile variable prevents the CPU from re-ordering the instructions before and after the access, and this generally slows down execution (though I can't say by how much).

More generally, on a multi-processor system I don't see how access to a volatile variable can be done without penalty -- there must be some way to ensure a write on processor A will be synchronized to a read on processor B.

View more solutions

23,975

Author by

Daniel

Developer at IKOffice GmbH, Oldenburg, Germany.

Updated on May 27, 2020

Comments

Daniel almost 4 years
After reading The JSR-133 Cookbook for Compiler Writers about the implementation of volatile, especially section "Interactions with Atomic Instructions" I assume that reading a volatile variable without updating it needs a LoadLoad or a LoadStore barrier. Further down the page I see that LoadLoad and LoadStore are effectively no-ops on X86 CPUs. Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast as a normal variable read (disregarding the reordering constraints of volatile)?

I believe I don't understand this correctly. Could someone care to enlighten me?

EDIT: I wonder if there are differences in multi-processor environments. On single CPU systems the CPU might look at it's own thread caches, as John V. states, but on multi CPU systems there must be some config option to the CPUs that this is not enough and main memory has to be hit, making volatile slower on multi cpu systems, right?

PS: On my way to learn more about this I stumbled about the following great articles, and since this question may be interesting to others, I'll share my links here:
- Java theory and practice: Fixing the Java Memory Model, Part 1 and
- Java theory and practice: Fixing the Java Memory Model, Part 2
Daniel over 13 years

Reading volatile variables has the same penalty than doing a monitor-enter, regarding the reordering possibilities of instructions, while writing a volatile variable equals a monitor-exit. A difference might be which variables (e.g. processor caches) get flushed or invalidated. While synchronized flushes or invalidates everything, access to the volatile variable should always be cache-ignoring.
Daniel over 13 years

An AtomicReference is just a wrapper to a volatile field with added native functions providing additional functionality like getAndSet, compareAndSet etc., so from a performance point of view using it is just useful if you need the added functionality. But I wonder why you refer to the OS here? The functionality is implemented in CPU opcodes directly. And does this imply that on multi processor systems, where one CPU has no knowledge about the cache contents of other CPUs that volatiles are slower because the CPUs always have to hit main memory?
jezg1993 over 13 years

Youre right I miss spoke about the OS shouldve wrote CPU, fixing that now. And yes, I do know AtomicReference is simply a wrapper for volatile fields but it also adds as a sort of documentation that the field itself will be access by multiple threads.
Michael Barker over 13 years

-1, Accessing a volatile variable is quite a bit different than using a synchronized block. Entering a synchronized block requires an atomic compareAndSet based write to take out the lock and a volatile write to release it. If the lock is contented then control has to pass from user space to kernel space to arbitrate the lock (this is the expensive bit). Accessing a volatile will always stay in user space.
Daniel over 13 years

@MichaelBarker: Are you sure that all monitors have to be guarded by the kernel and not the app?
Michael Barker over 13 years

@Daniel: If you represent a monitor using a synchronized block or a Lock then yes, but only if the monitor is contented. The only way to do this without kernel arbitration is to use the same logic, but busy spin instead of parking the thread.
Daniel over 13 years

@MichaelBarker: Okey, for contented locks I understand this.
bestsss about 13 years

@John, why would you add another indirection via an AtomicReference? If you need CAS - ok, but AtomicUpdater could be a better option. As far as I recall there no intrinsics about AtomicReference.
jezg1993 about 13 years

@bestsss For all general purpouses, youre right there is no difference between AtomicReference.set/get and volatile load and stores. That being said I had the same feeling (and do to some degree) about when to use which. This response can detail it a bit stackoverflow.com/questions/3964317/…. Using either is more of a preference, my only argument for using AtomicReference over a simple volatile is for clear documentation - that itself doesnt make the greatest argument either I understand
jezg1993 about 13 years

On a side note some argue using a volatile field/AtomicReference (without the need for a CAS) leads to buggy code old.nabble.com/…
bestsss about 13 years

@John, if I declare anything via AtomicReference, I am absolutely sure there will be some CAS involved. I rarely declare anything volatile w/o need to CAS, in most cases the variables are being updated by a single thread but retaining the ability to monitor 'em. The other option is stop boolean that's to be changed once only.
jezg1993 about 13 years

@bestsss You can make plenty of arguments on why you would use a volatile boolean over an AtomicBoolean and so forth. I can imagine someone has come across a need to share an AtomicXXX within objects (as well as multiple threads) in which it would need its mutability.
bestsss over 12 years

side note, java 6 has the same ability to show assembly (it's the hotspot that does it)
ewernli over 12 years

+1 In JDK5 volatile can not be reordered with respect to any read/write (which fixes the double-check locking, for instance). Does that imply that it will also affect how non-volatile fields are manipulated? It would be interesting to mix access to volatile and non-volatile fields.
Michael Barker about 12 years

@evemli, you need to be careful, I made this statement myself once, but was found to be incorrect. There is an edge case. The Java Memory Model allows roach motel semantics, when stores can be re-ordered ahead of volatile stores. If you picked this up from the Brian Goetz article on the IBM site, then it's worth mentioning that this article over simplifies the JMM specification.
curiousguy over 8 years

@Daniel What do you mean with "cache-ignoring"?
curiousguy over 4 years

"This is seen on systems that are cache coherent." Which systems are not?