Parallel for loop in openmp

c++ multithreading performance parallel-processing openmp

73,716

Solution 1

You should make use of the OpenMP reduction clause for x and y:

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.

Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)

Solution 2

let's try to understand how parallelize simple for loop using OpenMP

#pragma omp parallel
#pragma omp for
    for(i = 1; i < 13; i++)
    {
       c[i] = a[i] + b[i];
    }

assume that we have 3 available threads, this is what will happen

firstly

Threads are assigned an independent set of iterations

and finally

Threads must wait at the end of work-sharing construct

Solution 3

Because this question is highly viewed I decided to add a bit a OpenMP background to help those visiting it

The #pragma omp parallel creates a parallel region with a team of threads, where each thread executes the entire block of code that the parallel region encloses. From the OpenMP 5.1 one can read a more formal description :

When a thread encounters a parallel construct, a team of threads is created to execute the parallel region (..). The thread that encountered the parallel construct becomes the primary thread of the new team, with a thread number of zero for the duration of the new parallel region. All threads in the new team, including the primary thread, execute the region. Once the team is created, the number of threads in the team remains constant for the duration of that parallel region.

The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.

From the OpenMP 5.1 you can read a more formal description :

The worksharing-loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks. The iterations are distributed across threads that already exist in the team that is executing the parallel region to which the worksharing-loop region binds.

Moreover,

The parallel loop construct is a shortcut for specifying a parallel construct containing a loop construct with one or more associated loops and no other statements.

Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for. In your case, this would mean that:

#pragma omp parallel for 
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

A team of threads will be created, and to each of those threads will be assigned chunks of the iterations of the outermost loop.

To make it more illustrative, with 4 threads the #pragma omp parallel for with a chunk_size=1 and a static schedule would result in something like:

Code-wise the loop would be transformed to something logically similar to:

for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{  
    c[i]=a[i]+b[i];
}

where omp_get_thread_num()

The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.

and omp_get_num_threads()

Returns the number of threads in the current team. In a sequential section of the program omp_get_num_threads returns 1.

or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.

Armed with this knowledge, and looking at your code, one can see that you have a race-condition on the updates of the variables 'x' and 'y'. Those variables are shared among threads and update inside the parallel region, namely:

     x += rand_g1;
     y += rand_g2;

To solve this race-condition you can use OpenMP' reduction clause:

Specifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region.

Informally, the reduction clause, will create for each thread a private copy of the variables 'x' and 'y', and at the end of the parallel region perform the summation among all those 'x' and 'y' variables into the original 'x' and 'y' variables from the initial thread.

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

73,716

Author by

dsign

Love C++, Python and Haskell....

Updated on July 15, 2022

Comments

dsign almost 2 years

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

#include <vector>
#include <algorithm>

using namespace std;

int main () 
{
    int n=400000,  m=1000;  
    double x=0,y=0;
    double s=0;
    vector< double > shifts(n,0);


    #pragma omp parallel for 
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));     

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }

    cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}

I compile it with

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

that is, no "-fopenmp", and I get these timings:

real    0m18.417s
user    0m18.357s
sys     0m0.004s

when I do use "-fopenmp",

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

I get these numbers for the times:

real    0m6.853s
user    0m52.007s
sys     0m0.008s

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Parallel Merge-Sort in OpenMP

OpenMP Dynamic vs Guided Scheduling

Reducing on array in OpenMP

OpenMP: What is the benefit of nesting parallelizations?

How to parallelize do while and while loop in openmp?

Why does the memory consumption grows when using more threads on Linux? (C++)

OpenMP Thread count question

Why OpenMP program runs only in one thread

OpenMP recursive tasks

(C++ Threads): Creating worker threads that will be listening to jobs and executing them concurrently when wanted