OpenMP recursive tasks

10,317

With OMP_NESTED=FALSE, a team of threads is assigned to the top-level parallel region, and no extra threads at each nested level, so at most two threads will be doing useful work.

With OMP_NESTED=TRUE, a team of threads is assigned at each level. On your system there are 8 logical CPUs, so the team size is likely 8. The team includes one thread from outside the region, so only 7 new threads are launched. The recursion tree for fib(n) has about fib(n) nodes. (A nice self-referential property of fib!) Thus the code might create 7*fib(n) threads, which can quickly exhaust resources.

The fix is to use a single parallel region around the entire task tree. Move the omp parallel and omp single logic to main, outside of fib. That way a single thread team will work on the entire task tree.

The general point is to distinguish potential parallelism from actual parallelism. The task directives specify potential parallelism, which might or might not actually be used during an execution. An omp parallel (for all practical purposes) specifies actual parallelism. Usually you want the actual parallelism to match the available hardware, so as not to swamp the machine, but have the potential parallelism be much larger, so that the run-time can balance load.

Share:
10,317
user3457151
Author by

user3457151

Updated on August 05, 2022

Comments

  • user3457151
    user3457151 almost 2 years

    Consider following Program calculating Fibonacci Numbers.
    It uses OpenMP Tasks for parallelisation.

    #include <iostream> 
    #include <omp.h>
    
    using namespace std;
    
    int fib(int n)
    {
        if(n == 0 || n == 1)
            return n;
    
        int res, a, b;
        #pragma omp parallel
        {
            #pragma omp single 
            {
                #pragma omp task shared(a)
                a = fib(n-1);
                #pragma omp task shared(b)
                b = fib(n-2);
                #pragma omp taskwait
                res = a+b;
            } 
    
        }
        return res;
      }
    
    int main()
    {  
        cout << fib(40);    
    }
    

    I use gcc version 4.8.2 and Fedora 20.
    When compiling the above program with g++ -fopenmp name_of_program.cpp -Wall and running it, I see when looking into htop that only two (sometimes 3) threads are running. The machine I'm running this program on has 8 logical CPUs. My question is, what do I need to do to offload the work onto 8 Threads. I tried export OMP_NESTED=TRUE, but this leads to following error while running the Program:
    libgomp: Thread creation failed: Resource temporarily unavailable
    The point of my program is not to efficiently compute Fibonacci Numbers, but to use tasks or something similar in OpenMP.

  • user3457151
    user3457151 about 10 years
    Thank you for your answer. The fix you described increases CPU usage a lot, but not all CPUs are utilized. I have a very similar Fibonacci implementation using Thread Building Blocks, this one seems to be much faster when calculating fib(40). It also has fib(n) nodes in the recursion tree
  • Z boson
    Z boson about 10 years
    @user3457151, Fibonacci numbers cannot be calculated efficiently in parallel (except in closed form) so it does not matter how fast TBB is it cannot be faster than properly written serial code. On top of that, for performance, recursion is silly way to calculate the Fibonacci numbers.
  • user3457151
    user3457151 about 10 years
    I am aware of efficient ways to compute Fibonacci Numbers. Calculating Fibonacci Numbers is not the question here. What I want to do, is to create many Tasks recursivly in OpenMP and having them utilize all CPUs.
  • Arch D. Robison
    Arch D. Robison about 10 years
    I tried the example with the parallel and single constructs moved to main, compiled it with Intel C++, and CPU usage shot up to about 3185% on a system with 32 hardware threads. Compiling with g++ got me the same result. How are you determining how many CPUS are utilized?
  • Arch D. Robison
    Arch D. Robison about 10 years
    By the way, contrary to popular opinion, parallelism can be used to speed up the fast algorithms for computing Fibonacci numbers. A fast algorithm can compute very big numbers, so "big number" arithmetic will come into play, which in turn requires big FFTs, which lend themselves nicely to parallelism. It would be fun to have a contest for computing a big Fibonacci number the fastest.
  • user3457151
    user3457151 about 10 years
    @Arch D. Robison, I run top in the terminal and see that my program has CPU usage <= 598%.
  • Arch D. Robison
    Arch D. Robison about 10 years
    I can replicate this oddity on a 4-core machine with hyperthreading disabled, using gcc 4.7.2. I get 299% CPU utilization according to "top". But if I compile with icc 14.0.2, I get 399% utilization. I'm guessing that it's a problem in the GNU OpenMP run-time.