Number of threads within an OMP section

10,910

First of all, the OpenMP standard gives no guarantee that the two sections will be executed by different threads (Section 2.7.2 "sections Construct"):

The method of scheduling the structured blocks among the threads in the team is implementation defined.

The only reliable way to have the two work routines execute concurrently is by using explicit flow control based on the thread ID:

#pragma omp parallel num_threads(2)
{
   if (omp_get_thread_num() == 0)
   {
      omp_set_num_threads(1);
      Work1();
   }
   else
   {
      omp_set_num_threads(3);
      Work2();
   }
}

Further, whether the nested parallel region in Work2() will use more than one thread, depends on a combination of factors. Among those factors are the values of several internal control variables (ICVs):

  • nest-var controls whether nested parallelism is enabled; initialised from the value of OMP_NESTED and set at runtime by calling omp_set_nested();
  • thread-limit-var (since OpenMP 3.0) sets the top limit of the total amount of all OpenMP threads in all active parallel regions; initialised from the value of OMP_THREAD_LIMIT and set at runtime by the application of the thread_limit clause;
  • max-active-levels (since OpenMP 3.0) limits the amount depth of the active parallel regions; initialised from the value of OMP_MAX_ACTIVE_LEVELS and set by calling omp_set_max_active_levels().

If nest-var is false, then the value of the other ICVs do not matter - nested parallelism is disabled. This is the default value as mandated by the standard, therefore nested parallelism must be enabled explicitly.

If nested parallelism is enabled, it only works at levels up to max-active-levels with the outermost parallel region being at level 1, the first nested parallel region being at level 2, etc. The default value of that ICV is the number of levels of nested parallelism supported by the implementation. Parallel regions at deeper levels get disabled, i.e. execute serially with their master threads only.

If nested parallelism is enabled and a particular parallel region is nested at a level no deeper than max-active-levels, then whether it will execute in parallel or not is determined by the value of thread-limit-var. In your case, any value less than 4 will result in Work2() not being able to execute with three threads.

The following test program could be used to examine the interplay between those ICVs:

#include <stdio.h>
#include <omp.h>

void Work1(void)
{
   printf("Work1 started by tid %d/%d\n",
      omp_get_thread_num(), omp_get_num_threads());
}

void Work2(void)
{
   printf("Work2 started by tid %d/%d\n",
      omp_get_thread_num(), omp_get_num_threads());

   #pragma omp parallel for schedule(static)
   for (int i = 0; i < 3; i++)
   {
      printf("Work2 nested loop: %d by tid %d/%d\n", i,
         omp_get_thread_num(), omp_get_num_threads());
   }
}

int main(void)
{
   #pragma omp parallel num_threads(2)
   {
      if (omp_get_thread_num() == 0)
      {
         omp_set_num_threads(1);
         Work1();
      }
      else
      {
         omp_set_num_threads(3);
         Work2();
      }
   }
   return 0;
}

Sample outputs:

$ ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/1
Work2 nested loop: 1 by tid 0/1
Work2 nested loop: 2 by tid 0/1

The outermost parallel region is active. The nested one in Work2() is inactive because nested parallelism is disabled by default.

$ OMP_NESTED=TRUE ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/3
Work2 nested loop: 1 by tid 1/3
Work2 nested loop: 2 by tid 2/3

All parallel regions are active and execute in parallel.

$ OMP_NESTED=TRUE OMP_MAX_ACTIVE_LEVELS=1 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/1
Work2 nested loop: 1 by tid 0/1
Work2 nested loop: 2 by tid 0/1

Despite nested parallelism being enabled, only one level of parallelism could be active, therefore the nested region executes serially. With pre-OpenMP 3.0 compilers, e.g. GCC 4.4, setting OMP_MAX_ACTIVE_LEVELS has no effect.

$ OMP_NESTED=TRUE OMP_THREAD_LIMIT=3 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/2
Work2 nested loop: 2 by tid 1/2
Work2 nested loop: 1 by tid 0/2

The nested region is active, but executes with two threads only because of the global thread limit imposed by setting OMP_THREAD_LIMIT.

If your have enabled nested parallelism, there is no limit on the number of active levels, and the thread limit is sufficiently high, there should be no reason for your program not to use four CPU cores at the same time...

... unless process and/or thread binding is in effect. Binding controls the affinity of different OpenMP threads to the available CPUs. With most OpenMP runtimes thread binding is disabled by default and the OS scheduler is free to move the threads between the available cores as it deems fit. Nevertheless, the runtimes usually respect the affinity mask that applies to the process as whole. If you use something like taskset to e.g. pin/bind the process to two logical CPUs, then no matter how many threads are spawned, they will all run on the two logical CPUs and timeshare. With GCC thread binding is controlled by setting GOMP_CPU_AFFINITY and/or OMP_PROC_BIND, and with recent versions that support OpenMP 4.0 - by setting OMP_PLACES.

If you are not binding the executable (verify by checking the value of Cpus_allowed in /proc/$PID/status, where $PID is the PID of the running OpenMP process), neither GOMP_CPU_AFFINITY/OMP_PROC_BIND nor OMP_PLACES is set, nested parallelism is enabled, no limits on active parallelism levels or thread numbers are imposed, and programs like top or htop still show that only two logical CPUs are being used, then there is something wrong with your program's logic and not with the OpenMP environment.

Share:
10,910
Rodrigo Morante
Author by

Rodrigo Morante

Updated on July 20, 2022

Comments

  • Rodrigo Morante
    Rodrigo Morante almost 2 years

    My computer has four cores. I'm running Ubuntu 15.10, and compiling using g++ -fopenmp ...

    I have two different types of jobs, and both are mutually independent: Work1 and Work2. In particular, Work1 should run on a single processor, but Work2 should be parallelized. I tried using omp_set_num_threads():

    #pragma omp parallel sections
    {
        #pragma omp section
        {
            // Should run on one processor.
            omp_set_num_threads(1);
            Work1();
        }
    
        #pragma omp section
        {
            // Should run on as many processors as possible.
            omp_set_num_threads(3);
            Work2();
        }
    }
    

    Say Work2 is something like this:

    void Work2(...){
        #pragma omp parallel for
        for (...) ...
    
        return;
    }
    

    When the program is run, only two processors are used. Obviously omp_set_num_threads() is not working as I expected. Is there anything that can be done using OpenMP to remedy this situation?

    Thanks to all,

    Rodrigo

  • Rodrigo Morante
    Rodrigo Morante over 8 years
    Hristo, your answer is more than I could expect. I really appreciate all the time you devote to explaining all the details. I tried again the modification that you suggested, again to no avail. Only two cores are used. Analising /proc/16882/status, I see that Cpus_allowed: 0f, Cpus_allowed_list: 0-3. I will check the rest. Thank you again.
  • hexpheus
    hexpheus about 6 years
    I think that the very first code provided by Hristo Iliev is incorrect. According to the OpenMP 2.0 specification, the number of threads cannot be redefined inside a parallel section. Such an attempt will cause run-time errors and program termination of a C++ program. Am I missing something?
  • Hristo Iliev
    Hristo Iliev about 6 years
    @aligholamee, omp_set_num_threads() affects the number of threads for nested parallel regions. In that particular case - the parallel region contained in the Work2 function. The omp_set_num_threads(1) call in the first branch of the conditional is not really necessary if Work1 only contains and calls sequential code.