Java threading optimization at 100% CPU usage

java multithreading performance

22,868

Solution 1

If you have too many simultaneous compute-intensive tasks in parallel threads, you reach the point of diminishing returns very quickly. In fact, if there are N processors (cores), then you don't want more than N such threads. Now, if the tasks occasionally pause for I/O or user interaction, then the right number can be somewhat larger. But in general, if at any one moment there are more threads that want to do computation than there are cores available, then your program is wasting time on context switches -- i.e., the scheduling is costing you.

Solution 2

The fact that your CPU is running at 100% does not tell much about how busy they are doing useful work. In your case, you are using more threads than cores so the 100% includes some context switching and uses memory unnecessarily (small impact for 100 threads), which is sub-optimal.

For CPU intensive task, I generally use this idiom:

private final int NUM_THREADS = Runtime.getRuntime().availableProcessors() + 1;
private final ExecutorService executor = Executors.newFixedThreadPool(NUM_THREADS);

Using more threads, as others have indicated, only introduces unnecessary context switching.

Obviously if the tasks do some I/O and other blocking operations, this is not applicable and a larger pool would make sense.

EDIT

To reply to @MartinJames comment, I have run a (simplistic) benchmark - result shows that going from a pool size = number of processors + 1 to 100 degrades the performance only slightly (let's call it 5%) - going to higher figures (1000 and 10000) does hit the performance significantly.

Results are the average of 10 runs:
Pool size: 9: 238 ms. //(NUM_CORES+1)
Pool size: 100: 245 ms.
Pool size: 1000: 319 ms.
Pool size: 10000: 2482 ms.

Code:

public class Test {

    private final static int NUM_CORES = Runtime.getRuntime().availableProcessors();
    private static long count;
    private static Runnable r = new Runnable() {

        @Override
        public void run() {
            int count = 0;
            for (int i = 0; i < 100_000; i++) {
                count += i;
            }
            Test.count += count;
        }
    };

    public static void main(String[] args) throws Exception {
        //warmup
        runWith(10);

        //test
        runWith(NUM_CORES + 1);
        runWith(100);
        runWith(1000);
        runWith(10000);
    }

    private static void runWith(int poolSize) throws InterruptedException {
        long average = 0;
        for (int run = 0; run < 10; run++) { //run 10 times and take the average
            Test.count = 0;
            ExecutorService executor = Executors.newFixedThreadPool(poolSize);
            long start = System.nanoTime();
            for (int i = 0; i < 50000; i++) {
                executor.submit(r);
            }
            executor.shutdown();
            executor.awaitTermination(10, TimeUnit.SECONDS);
            long end = System.nanoTime();
            average += ((end - start) / 1000000);
            System.gc();
        }
        System.out.println("Pool size: " + poolSize + ": " + average / 10 + " ms.  ");
    }
}

Solution 3

To get the most work done the quickest: am I best off to just launch more threads when I need to do more work and let the Java thread scheduler handle distributing the work, or would getting smarter and managing the work load to keep the CPU below 100% get me further faster?

As you add more and more threads the overhead incurred in the context-switching, memory cache flushing, memory cache overflowing, and kernel and JVM thread management increases. As your threads hog the CPU their kernel priorities drop to some minimum and they will reach the time-slice minimum. As more and more threads crowd memory, they overflow the various internal CPU memory caches. There is a higher chance the CPU will need to swap the job in from slower memory. Internal to the JVM there is more mutex local contention and probably some (maybe small) incremental per-thread and object bandwidth GC overhead. Depending on how synchronized your user-tasks are, more threads would cause increased memory flushing and lock contention.

With any program and any architecture, there is a sweet spot where threads can optimally utilize the available processor and IO resources while limiting kernel and JVM overhead. Finding that sweet spot repeatedly will require a number of iterations and some guesswork.

I would recommend using the Executors.newFixedThreadPool(SOME_NUMBER); and submit you jobs to it. Then you can do multiple runs varying the number of threads up and down until you find the optimal number of pools running simultaneously according to the work and the architecture of the box.

Understand however, that the optimal number of threads will vary based on how many processors and other factors that may be non-trivial to determine. More threads may be needed if they are blocking on disk or network IO resources. Fewer threads if the work they are doing is mostly CPU based.

Solution 4

'Would getting smarter and managing the work load to keep the CPU below 100% get me further faster?'

Probably not.

As others have posted, 100 threads is too many for a threadpool if most of the tasks are CPU-intensive. It won't make much difference to performance on typical systems - with that much overload it will be bad with 4 threads and bad with 400.

How did you decide on 100 threads? Why not 16, say?

'The number of threads is not massive, say up to 100' - does it vary? Just create 16 at startup and stop managing them - just pass the queue to them and forget about them.

Horrible thought - you aren't creating a new thread for each task, are you?

View more solutions

22,868

Author by

Kong

I'm a big hairy gorilla.

Updated on March 23, 2020

Comments

Kong about 4 years

I have an application that accepts work on a queue and then spins that work off to be completed on independent threads. The number of threads is not massive, say up to 100, but these are intensive tasks and can quickly bump the CPU up to 100%.

To get the most work done the quickest: am I best off to just launch more threads when I need to do more work and let the Java thread scheduler handle distributing the work, or would getting smarter and managing the work load to keep the CPU below 100% get me further faster?

The machine is dedicated to my java app.

EDIT:

Thanks for the fantastic input!

The tasks are of varying complexity and involve I/O so having a low thread pool of say 4 may well be only running the CPU to 20%. I have no way of knowing how many tasks will actually take the CPU to 100%.

My thought was should I monitor CPU through RMI and dynamically dial the work up and down, or should I just not care and let the OS handle it.
Martin James about 12 years

How much is it costing? If there are a lot of CPU-intensive tasks, the box is overloaded and there are always more ready threads than processors then, it is likely that the OS will resort to changing th eset of ready threads at each imer interrupt. So, no matter how many CPU-bound threads there are, the number of context-switches will be limited to one every.. 30ms or whatever. This is not significant. What you say above is true, but not any justification for extra thread micromanagement by the user, (which very often goes wrong).
Martin James about 12 years

Eh? Context-switches only happen when the OS is entered on an interrupt. If there is a large set of CPU-intensive ready threads, the OS will swap around the running set, (probably mostly after the timer interrupt). Once the set of ready threads is larger than the number of cores, the context-switch overhead is nearly constant as more thread are added.
Martin James about 12 years

Why? What is wrong with 100 threads, (assuming sufficient RAM to hold all the stacks etc. without continual paging)?
Martin James about 12 years

'Using more threads, as others have indicated, only introduces unnecessary context switching' When number of ready threads equals the number of cores, that's as much context-switching as you are going to get. Adding more threads wil not significantly add to the overhead.
assylias about 12 years

Why 16? Adjusting the number of threads dynamically to the number of available processors does make sense - why not (num_processor * 1.5) for example?
Martin James about 12 years

num_processor * 1.5 - fine. num_processor * 15, not much different, really. Yes - get the number of processors from the sysinfo, double it and add the number you first thought of <g>
Martin James about 12 years

thanks, I should have added 'more threads wil not significantly add to the overhead until the stacks etc. exceed available RAM and start getting paged out'. If taken to extremes almost everything is bad :)
Martin James about 12 years

maybe you could publish your benchmark code - I would maybe have a go too! How do you measure performance?
assylias about 12 years

@MartinJames What do you mean? The code is in my answer. The only caveat is that for higher number of threads, the GC kicks in even between the calls to System.gc().
Martin James about 12 years

Hmm.. good point about the GC - forgot about that possible influence. I will translate your benchmark to C++, (no GC), and try that.
Ernest Friedman-Hill about 12 years

@MartinJames -- I hardly think that choosing an optimally sized thread pool is micromanagement. In any case: citation, please. Your statements fly in the face of every standard text and reference. Just one example that agrees with me: ibm.com/developerworks/library/j-jtp0730/index.html.
Martin James about 12 years

Quick trial on C++ with similar design. 20 threads, 987 ticks. 2000 threads, 1016 ticks. Using a pool of two thousand threads is 3% slower than using a pool of 20 threads. This is on a high end desktop and so, with less RAM etc. there will be factors that make 2000 threads a lot slower. Conclusion - unless resource are very constrained, it doesn't matter much, within reason, how many threads you have in your pool in either Java or C++.
Martin James about 12 years

@assilias has shown that 100 threads is not a significant issue in Java. In C++, 2000 threads is not a significant issue.
assylias about 12 years

@MartinJames Fair enough ;) Note that the effect seems to be slightly higher in my test, possibly because of the overhead caused by the executor. But I agree that 20 or 100 doesn't make a significant difference.
Martin James about 12 years

Maybe the executor... don't know. My C++ threadpool is a simple producer-consumer queue that the threads wait on for work, so yes, Java ExecutorService has more features and so is likely to have more overhead. Anyway, I forgot to upvote for your excellent investigation, so now I have :)
Martin James about 12 years

I don't care about generalized texts and references from IBM. Assillias and myself did some testing. In Java, for CPU-intensive tasks, it doesn't matter much whether you have 20 or 200 threads in your pool. With non-GC C++, it doesn't matter whether there are 20 or 2000. These results are not remotely surprising to me. The OS has a queue, (OK an array[priority] of queues), of ready threads, probably a linked-list. The kernel gets [no. of cores] threads from the front of the queue and puts the old [no. of cores] at the back. Why should it matter how many entries there are in the queue?
Martin James about 12 years

I should add that there are surely thread-number-related factors that should influence a choice of threads for a pool. 200/2000 stacks has a memory cost and, if that cost pushes the working-set into paging, then all those threads are going to generate a bad performance hit. However, that is nothing directly to do with context-switching overhead - add more RAM and the problem goes away.
Martin James about 12 years

More empirical results. C++ CPU-intensive tasks. i7, 3GHz, 4 cores(8 w. hyper), 12GB RAM. ticks/poolThreadCount/taskManagerCPU: 356/8/34, 287/16/29, 280/80/30, 284/800/28. The optimum pool thread count is bigger than [number of cores], and significantly so. It seems, so far, that if you want your CPU-intensive tasks run as fast as possible, use 80 threads. If you want them run as efficiently as possible, use 800. Even I find this unreasonable, so someone prove me wrong...
Gray about 12 years

To me context-switching is what happen at the CPU level when either the thread yields for IO or the timer fires and there are too many jobs in the run queue. This flushes cache memory (L1), copies running state to memory, swaps in next job. I'll agree that there is a JVM/OS overhead floor but it is more complicated with cache memory overflow, CPU priority penalties, and JVM overhead to take into account. But I should in my answer talk more about the limits.
Gray about 12 years

Hey @Martin. I don't understand your numbers. Can you explain the ticks/poolThreadCount/taskManagerCPU more? thread-count I get but ticks per what unit? What is taskManagerCPU?
Martin James about 12 years

In the case of these CPU-bound jobs where the box is overloaded with manyy more redy threads than cores, there will be overhead on every change to the running set, as you describe. This can only happen on hardware interrupts or system calls. CPU-intensive tasks typically don't make frequent system calls, so that leaves interrupts. If we neglect page faults, the CPU-bound tasks won't do much IO either, so that leaves the timer interrupt. The frequency of that is independent of the number of ready threads, so the overhead it generates is independent of the number of ready threads.
Gray about 12 years

I'll agree although the time slice window is not fixed. It does drop to some floor as the kernel penalizes CPU bound jobs. Also, I'm not 100% certain that page-faults can be neglected. I'm not sure if the CPU/kernel have the ability to use memory location in scheduling these days -- probably not.
Gray about 12 years

Oh and although CPU bound jobs don't make system calls they may deal with memory barriers or locks which causes JVM interrupts as well obviously.
Martin James about 12 years

Ticks is [GetTickCount() at end of a load of CPU-intensive tasks - GetTickCount() at start of the load of CPU-intensive tasks]. It's approx. milliseconds. poolThreadCount is the number of threads waiting on the task producer-consumer queue. 'taskManagerCPU' is a rough 'by eye' guesstimate of total CPU use from the Task Manager 'Performance' tab. I will work on my benchmark code to get the load up to 100% on all cores - I think that I'm at ~30% because my tasks are too short and threads are contending and blocking on the queue.
Martin James about 12 years

@Gray - and so it proved. My tasks are boring and just do the usual - increment an integer member. I added another '0' to the for loop. It's now counts up to 100000000 and I queue up 400 of them. My CPU use during the test is now 100% on all 4/HT8 cores. Ticks/threadCount so far: 21922/8, 20424/80, 20191/800. Better with 800 threads! CPU fan is making a lot of noise and it's getting warm in here.
Gray about 12 years

Thanks for the info. So ticks is an OS time (~millis) of the entire job. With 800 threads do you just divide 100000000 by 800 and each ne of them does that many of ++? I'm interested in running tests like this in Java considering the OP has the java tag.
Martin James about 12 years

@Gray Assylias has Java code in this reply and has done some tests. When I do my C++ test now, I have a task class whose run method counts up to 100000000. I create 400 of them and submit them all to a producer-consumer queue. There are 800 threads hanging off the queue and so, now that you've made me think about it, 400 of them never get to run because there are only enough tasks for 400.
Gray about 12 years

Oh didn't see that. Nice. Yeah that's what I would have thought. There are deflection points that make it more complicated. Maybe this is just Java.
Martin James about 12 years

This is an interesting issue, though. For years, I've seen posts about only creating [no. of cores] threads for best performance because of the 'overhead of context-switching'. Now I seem to find that a lot more threads than that is best, even with CPU-bound jobs.