Performance of synchronize section in Java

32,166

Solution 1

There are 3 type of locking in HotSpot

  1. Fat: JVM relies on OS mutexes to acquire lock.
  2. Thin: JVM is using CAS algorithm.
  3. Biased: CAS is rather expensive operation on some of the architecture. Biased locking - is special type of locking optimized for scenario when only one thread is working on object.

By default JVM uses thin locking. Later if JVM determines that there is no contention thin locking is converted to biased locking. Operation that changes type of the lock is rather expensive, hence JVM does not apply this optimization immediately. There is special JVM option - XX:BiasedLockingStartupDelay=delay which tells JVM when this kind of optimization should be applied.

Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions.

Answer to the question: it depends. But if biased, the single threaded code with locking and without locking has average same performance.

Solution 2

Single-threaded code will still run slower when using synchronized blocks. Obviously you will not have other threads stalled while waiting for other threads to finish, however you will have to deal with the other effects of synchronization, namely cache coherency.

Synchronized blocks are not only used for concurrency, but also visibility. Every synchronized block is a memory barrier: the JVM is free to work on variables in registers, instead of main memory, on the assumption that multiple threads will not access that variable. Without synchronization blocks, this data could be stored in a CPU's cache and different threads on different CPUs would not see the same data. By using a synchronization block, you force the JVM to write this data to main memory for visibility to other threads.

So even though you're free from lock contention, the JVM will still have to do housekeeping in flushing data to main memory.

In addition, this has optimization constraints. The JVM is free to reorder instructions in order to provide optimization: consider a simple example:

foo++;
bar++;

versus:

foo++;
synchronized(obj)
{
    bar++;
}

In the first example, the compiler is free to load foo and bar at the same time, then increment them both, then save them both. In the second example, the compiler must perform the load/add/save on foo, then perform the load/add/save on bar. Thus, synchronization may impact the ability of the JRE to optimize instructions.

(An excellent book on the Java Memory Model is Brian Goetz's Java Concurrency In Practice.)

Solution 3

There is some overhead in acquiring a non-contested lock, but on modern JVMs it is very small.

A key run-time optimization that's relevant to this case is called "Biased Locking" and is explained in the Java SE 6 Performance White Paper.

If you wanted to have some performance numbers that are relevant to your JVM and hardware, you could construct a micro-benchmark to try and measure this overhead.

Solution 4

Using locks when you don't need to will slow down your application. It could be too small to measure or it could be surprisingly high.

IMHO Often the best approach is to use lock free code in a single threaded program to make it clear this code is not intended to be shared across thread. This could be more important for maintenance than any performance issues.

public static void main(String... args) throws IOException {
    for (int i = 0; i < 3; i++) {
        perfTest(new Vector<Integer>());
        perfTest(new ArrayList<Integer>());
    }
}

private static void perfTest(List<Integer> objects) {
    long start = System.nanoTime();
    final int runs = 100000000;
    for (int i = 0; i < runs; i += 20) {
        // add items.
        for (int j = 0; j < 20; j+=2)
            objects.add(i);
        // remove from the end.
        while (!objects.isEmpty())
            objects.remove(objects.size() - 1);
    }
    long time = System.nanoTime() - start;
    System.out.printf("%s each add/remove took an average of %.1f ns%n", objects.getClass().getSimpleName(),  (double) time/runs);
}

prints

Vector each add/remove took an average of 38.9 ns
ArrayList each add/remove took an average of 6.4 ns
Vector each add/remove took an average of 10.5 ns
ArrayList each add/remove took an average of 6.2 ns
Vector each add/remove took an average of 10.4 ns
ArrayList each add/remove took an average of 5.7 ns

From a performance point of view, if 4 ns is important to you, you have to use the non-synchronized version.

For 99% of use cases, the clarity of the code is more important than performance. Clear, simple code often performs reasonably good as well.

BTW: I am using a 4.6 GHz i7 2600 with Oracle Java 7u1.


For comparison if I do the following where perfTest1,2,3 are identical.

    perfTest1(new ArrayList<Integer>());
    perfTest2(new Vector<Integer>());
    perfTest3(Collections.synchronizedList(new ArrayList<Integer>()));

I get

ArrayList each add/remove took an average of 2.6 ns
Vector each add/remove took an average of 7.5 ns
SynchronizedRandomAccessList each add/remove took an average of 8.9 ns

If I use a common perfTest method it cannot inline the code as optimally and they are all slower

ArrayList each add/remove took an average of 9.3 ns
Vector each add/remove took an average of 12.4 ns
SynchronizedRandomAccessList each add/remove took an average of 13.9 ns

Swapping the order of tests

ArrayList each add/remove took an average of 3.0 ns
Vector each add/remove took an average of 39.7 ns
ArrayList each add/remove took an average of 2.0 ns
Vector each add/remove took an average of 4.6 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.5 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.4 ns
ArrayList each add/remove took an average of 2.4 ns
Vector each add/remove took an average of 4.6 ns

one at a time

ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 2.3 ns
ArrayList each add/remove took an average of 2.2 ns
ArrayList each add/remove took an average of 2.4 ns

and

Vector each add/remove took an average of 28.4 ns
Vector each add/remove took an average of 37.4 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Share:
32,166
Anton
Author by

Anton

Updated on July 17, 2022

Comments

  • Anton
    Anton almost 2 years

    I had a small dispute over performance of synchronized block in Java. This is a theoretical question, which does not affect real life application.

    Consider single-thread application, which uses locks and synchronize sections. Does this code work slower than the same code without synchronize sections? If so, why? We do not discuss concurrency, since it’s only single thread application

    Update

    Found interesting benchmark testing it. But it's from 2001. Things could have changed dramatically in the latest version of JDK

  • brady
    brady over 12 years
    Citation, please. I don't think that the JVM can eliminate monitor entries and exits entirely.
  • jarnbjo
    jarnbjo over 12 years
    I've read that somewhere as well. If the Hotspot compiler is sure that the code is only reachable from one thread, it should skip the synchronization completely. I'm not quite sure about the "is sure ..." part though and I've never actually managed to get the VM to do it. Even in a single-thread application, synchronization overhead should not be underestimated.
  • Anton
    Anton over 12 years
    Not sure, that it is possible for JVM to make this optimization
  • AlexR
    AlexR over 12 years
    I tested this. It is so small that you cannot measure the effect at all. They say that the effect was much more significant for older versions of JVM.
  • NPE
    NPE over 12 years
    @AlexR: Nice, thanks for sharing. It does not surprise me that the effect used to be more significant, since Biased Locking optimizations only got added in Java 6.
  • Stefan
    Stefan over 12 years
    I tested it on an IBM JDK and except for the first run Vector and ArrayList have about a 10% performance difference on my machine (54ns vs 48-50ns). I also tested it with Collections.synchronizedList and was surprised by its bad perfomance. It was about twice as slow as Vector/ArrayList(110ns).
  • Vishy
    Vishy over 12 years
    This is another reason to be concerned about micro-tuning. Using a different system, hardware, JVM can give you a different result.
  • bestsss
    bestsss over 12 years
    btw, the code like that is first optimized for Vector, then deoptimized and optimized again, since the call target (List<Integer>) changes. Since can't be sure about the proper deoptimization (could be just guarded call to vector + trap) ArrayList case would suffer. Can you swap the test, i.e. ArrayList then Vector. Mostly curious. OTOH the case is bloody perfect bias locking test too. Also the CAS is quite cheap on your CPU, on older architectures CAS is quite an expensive call (if biased locking is disabled)
  • Stefan
    Stefan over 12 years
    @bestsss Check ideone.com/8jpoZ. Didnt make a difference. It looks like Vector needs a while to warmup, see last results.
  • Vishy
    Vishy over 12 years
    @bestsss I think the tests show its best to measure rather than guess, because I wouldn't have thought that running ArrayList first would be faster for both collections, or that running Vector alone would be slower than with ArrayList.
  • Vishy
    Vishy over 12 years
    The results appear to be platform dependant. Are you using Java 7u1?
  • Stefan
    Stefan over 12 years
    @Peter Lawrey Ideone is using sun-jdk-1.6.0.17 at work we use the IBM JDK version 1.6.0
  • Vishy
    Vishy over 12 years
    @Stefan, it is curious that its 2.3 ns on my test vs 50 ns in your test. It suggests that other factors are far more important. ;)
  • Stefan
    Stefan over 12 years
    Your machine is probably better than my 5 year old (steam powered) computer. Btw i ran the test with more warmup and i consistenly get(time factor) AL 1x(~26ns), Vec 2x(~52ns), Sync >3x(~90ns)
  • Vishy
    Vishy over 12 years
    @Stefan can you try the code calling three copies of the perfTest method as I have above?
  • Vishy
    Vishy over 12 years
    It can coalesce synhchronized blocks when inlining. This combines the locking of several methods into one to reduce its overhead.
  • Stefan
    Stefan over 12 years
    @Peter Lawrey Ah i forgot i made some changes to it. But the numbers are the same after some warmup(~26ns vs ~52ns).
  • bestsss
    bestsss over 12 years
    another side note for better warm up and microbenchmark, spread the big loop into 2 different methods. something like static void invokeTest (){for i=0;i<100000;i++) perfTest(objects);} and reduce runs to something like 1000, time calculation should be outside the inner loop, ofc. this allows the JIT to avoid direct OSR in the very hot loop. Some nice info: azulsystems.com/blog/cliff/…
  • bestsss
    bestsss over 12 years
    @erickson, lock coarsening is a standard procedure + biased locked, you can look up how it works. although biased locking can be hit and miss depending on the underlying architecture.
  • irreputable
    irreputable over 12 years
    so small that you cannot measure the effect at all such claim cannot be made lightly. when test something in a tight loop, JVM can do great magics. but that doesn't represent "real world" apps. JVM gets stupid really fast when execution becomes complex.
  • arkon
    arkon almost 11 years
    @AlexR I agree with irreputable's comment, in fact, your statement contradicts itself. Besides, unless you claim to be an expert in designing reliable and proper microbenchmarks in Java, then anyone should be taking that bold claim with a grain of salt.
  • Maarten Bodewes
    Maarten Bodewes almost 9 years
    Very informative. Could you however indicate for which version of Java / VM this answer has been written?
  • zakmck
    zakmck about 5 years
    The OP is asking what happens for single-thread applications, ie, if unnecessary synchronisation makes a difference. The differences you obtain in your case are obvious, cause you're synchronising many actual threads upon the same object all the time, forcing them to almost run one at a time.