best use of Parallel.ForEach / Multithreading

14,466

Solution 1

Something worth checking out is the TPL Dataflow library.

DataFlow on MSDN.

See Nesting await in Parallel.ForEach

The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.

Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.

Solution 2

you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.

Heres the code snippet -

ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;

Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);

Solution 3

In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).

You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.

Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.

Share:
14,466

Related videos on Youtube

Zoinky
Author by

Zoinky

Updated on June 04, 2022

Comments

  • Zoinky
    Zoinky almost 2 years

    I need to scrape data from a website. I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.

    I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?

    I have not worked with Parallel before so maybe my approach maybe wrong.

  • GalacticCowboy
    GalacticCowboy about 8 years
    Note that this only controls the maximum number of threads - the system is still able to use fewer threads if it decides so. MaxDegreesOfParallelism is not a guarantee, only an upper bound. And if you do not set a value here, the default is based on the number of cores, system load, etc.