Disappointing performance with Parallel.For

22,408

Solution 1

Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it. You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.

The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables

        int[] nums = Enumerable.Range(0, 1000000).ToArray();
        long total = 0;

        // Use type parameter to make subtotal a long, not an int
        Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
        {
            subtotal += nums[j];
            return subtotal;
        },
            (x) => Interlocked.Add(ref total, x)
        );

Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.

Solution 2

Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.

Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).

Solution 3

I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:

int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;

Parallel.ForEach(
    Partitioner.Create(0, nums.Length),
    () => 0,
    (part, loopState, partSum) =>
    {
        for (int i = part.Item1; i < part.Item2; i++)
        {
            partSum += nums[i];
        }
        return partSum;
    },
    (partSum) =>
    {
        Interlocked.Add(ref total, partSum);
    }
);

Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.

22,408

Author by

Anders Gustafsson

I am the owner of Cureos AB, a software development and consulting company located in Uppsala, Sweden. The company's main focus is in developing software for dose-response analysis and optimization of large patient treatment materials, primarily using the .NET framework. In my Ph.D. thesis I outlined a general optimization framework for radiation therapy, and I have developed numerous tools for radiotherapy optimization and dose-response modeling that have been integrated into different treatment planning systems.

Updated on July 09, 2022

Comments

Anders Gustafsson almost 2 years
I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?

Here is my test code, sequential:
```
var loops = 200;
var perloop = 10000000;

var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
    var sumk = 0.0;
    for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
    sum += sumk;
}
```
and parallel:
```
sum = 0.0;
Parallel.For(0, loops,
                k =>
                    {
                        var sumk = 0.0;
                        for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
                        sum += sumk;
                    });
```
The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.

In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.

In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set
```
ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount
```
but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?

I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.

UPDATE

It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.

With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.

Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.

So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?
Anders Gustafsson almost 12 years

Thanks for your answer, Eric. I recompiled the code with .NET 4.5, but for my case the timings are the same. It's good to get the parallelization issues listed. I did not expect exact linear scalability even in my highly parallelized case, but I had expected better than 4 times on an 8-core CPU. Should I expect that manual thread creation would improve performance?
High Performance Mark almost 12 years

I am pleased to report that in this case Gustafson's Law (en.wikipedia.org/wiki/Gustafson's_law) may be of interest. It is worth understanding how parallel performance changes wrt job size.
Anders Gustafsson almost 12 years

Did not know about that one, @HighPerformanceMark, but from now on I will always refer to this law regardless :-)
Anders Gustafsson almost 12 years

Many thanks for this valuable answer, Panagiotis! I did not know of this feature, but it is definitely something I will make use of in my production code.
Anders Gustafsson almost 12 years

Marking this as the answer. The solution does not appear to speed things up substantially in the test case I have outlined above, but it adds an invaluable level of reliability to the parallelized code.
Jacob Sobus almost 9 years

Look at my comment in question for more info about hyperthreading.
ToolmakerSteve about 6 years

Jacob, according to docs.microsoft.com/en-us/dotnet/standard/parallel-programmin‌g/… , the Task Parallel Library (Parallel.ForEach in your example) automatically makes a guess at how to "chunk" the requested task - it does not "create new task in each iteration". [That would be bad performance.] The Partitioner.Create overload you call is presumably the one that is used internally. Can't gain any benefit unless you call an overload that takes more parameters - and have tested to determine a chunk size that is better for your specific task.
ToolmakerSteve about 6 years

... (Since ForEach says it requires dynamic partitioning, I suspect it makes a guess, then uses the actual time used by the first chunk that returns, to make a better guess on next chunking - which would be hard to beat by any static parameters you provide, except in well-defined and thoroughly tested situations.) NOTE: I don't know if this improvement happened after you wrote your answer; just making sure readers are aware of the situation.
ToolmakerSteve about 6 years

Your code has one more zero in the non-parallel case. (A classic example of the kind of mistake one can make, if one does not bother to defiine a constant or variable to hold a "magic" number that will be re-used! - in this case, the number of loop iterations.)
Jacob Sobus over 5 years

@ToolmakerSteve thanks for this comment, I didn't know all this details that time :) Haven't seen the article You linked to, looks like great reference to that problem. So as always .NET has already good solutions built in but always "hand tuned" tweaks can be done if needed.
Seabizkit over 3 years

@OP please check this out i believe this is the reason...it more expensive to make tasks for each loop, what you want to do is batch the big number 10000000 into groups equal to the number of cores... so 8 cores, 8 groups, then each group is done in parrell so each core does (1 250 000). Its extremely inefficient to try do each one on its own thread, tbh im surprised it faster.