OpenMP: nowait and reduction clauses on the same pragma

c++ openmp reduction

14,038

Solution 1

The nowaits in the second and last loop are somewhat redundant. The OpenMP spec mentions nowait before the end of the region so perhaps this can stay in.

But the nowait before the second loop and the explicit barrier after it cancel each other out.

Lastly, about the shared and private clauses. In your code, shared has no effect, and private simply shouldn’t be used at all: If you need a thread-private variable, just declare it inside the parallel region. In particular, you should declare loop variables inside the loop, not before.

To make shared useful, you need to tell OpenMP that it shouldn’t share anything by default. You should do this to avoid bugs due to accidentally shared variables. This is done by specifying default(none). This leaves us with:

#pragma omp parallel default(none) shared(n, a, b, c, d, sum)
{
    #pragma omp for nowait
    for (int i = 0; i < n; ++i)
        a[i] += b[i];

    #pragma omp for
    for (int i = 0; i < n; ++i)
        c[i] += d[i];

    #pragma omp for nowait reduction(+:sum)
    for (int i = 0; i < n; ++i)
        sum += a[i] + c[i];
} // End of parallel region

Solution 2

In some regards this seems like a homework problem, which I hate to do for people. On the other hand, the answers above are not totally accurate and I feel should be corrected.

First, while in this example both the shared and private clauses are not needed, I disagree with Konrad that they shouldn't be used. One of the most common problems with people parallelizing code, is that they don't take the time to understand how the variables are being used. Not privatizing and/or protecting shared variables that should be, accounts for the largest number of problems that I see. Going through the exercise of examining how variables are used and putting them into the appropriate shared, private, etc. clauses will greatly reduce the number of problems you have.

As for the question about the barriers, the first loop can have a nowait clause, because there is no use of the value computed (a) in the second loop. The second loop can have a nowait clause only if the value computed (c) is not used before the values are calculated (i.e., there is no dependency). In the original example code there is a nowait on the second loop, but an explicit barrier before the third loop. This is fine, since your professor was trying to show the use of an explicit barrier - though leaving off the nowait on the second loop would make the explicit barrier redundant (since there is an implicit barrier at the end of a loop).

On the other hand, the nowait on the second loop and the explicit barrier may not be needed at all. Prior to the OpenMP V3.0 specification, many people assumed that something was true that was not clarified in the specification. With the OpenMP V3.0 specification the following was added to section 2.5.1 Loop Construct, Table 2-1 schedule clause kind values, static (schedule):

A compliant implementation of static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and 3) both loop regions bind to the same parallel region. A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.9 on page 170 for examples).

Now in your example, no schedule was shown on any of the loops, so this may or may not hold. The reason is, that the default schedule is implementation defined and while most implementations currently define the default schedule to be static, there is no guarantee of that. If your professor had put on a schedule type of static without a chunk-size on all three loops, then nowait could be used on the first and second loop and no barrier (either implicit or explicit) would be needed between the second and third loops at all.

Now we get to the third loop and your question about nowait and reduction. As Michy pointed out, the OpenMP specification allows both (reduction and nowait) to be specified. However, it is not true that no synchronization is needed for the reduction to be complete. In the example, the implicit barrier (at the end of the third loop) can be removed with the nowait. This is because the reduction (sum) is not being used before the implicit barrier of the parallel region has been encountered.

If you look at the OpenMP V3.0 specification, section 2.9.3.6 reduction clause, you will find the following:

If nowait is not used, the reduction computation will be complete at the end of the construct; however, if the reduction clause is used on a construct to which nowait is also applied, accesses to the original list item will create a race and, thus, have unspecified effect unless synchronization ensures that they occur after all threads have executed all of their iterations or section constructs, and the reduction computation has completed and stored the computed value of that list item. This can most simply be ensured through a barrier synchronization.

This means that if you wanted to use the sum variable in the parallel region after the third loop, then you would need a barrier (either implicit or explicit) before you used it. As the example stands now, it is correct.

Solution 3

The OpenMP speficication says:

The syntax of the loop construct is as follows:
#pragma omp for [clause[[,] clause] ... ] new-line
    for-loops
where clause is one of the following:
 ...
 reduction(operator: list)
 ...
 nowait

So there can be more clauses thus there can be both reduction and nowait statement.

There is no need of explicit synchronization in the reduction clause - the adding to the sum variable is synchronized because of reduction(+: sum) and previous barrier forces a and b having final values in the time of reduction loop. The nowait means that if the thread finishes the work in the loop, it does not have to wait until all other threads will finish the same loop.

14,038

Author by

aperez

Updated on June 17, 2022

Comments

aperez about 2 years

I am studying OpenMP, and came across the following example:

#pragma omp parallel shared(n,a,b,c,d,sum) private(i)
{
    #pragma omp for nowait
    for (i=0; i<n; i++)
        a[i] += b[i];

    #pragma omp for nowait
    for (i=0; i<n; i++)
        c[i] += d[i];
    #pragma omp barrier

    #pragma omp for nowait reduction(+:sum)
    for (i=0; i<n; i++)
        sum += a[i] + c[i];
} /*-- End of parallel region --*/

In the last for loop, there is a nowait and a reduction clause. Is this correct? Doesn't the reduction clause need to be syncronized?

aperez about 13 years

Thank you for your clarification.
aperez about 13 years

Thank you for your very detailed answer ejd! I agree with you that this does, in fact, seem like a homework question, as this is not my code - I am just starting to learn OpenMP by studying examples. If you know a more appropriate place to ask these kinds of questions, or a tag to represent them (similar to the homework tag), please tell me.
Konrad Rudolph about 13 years

I have to disagree about the use of shared/private clauses, and I’ve expanded this in my answer: if you use private variables, make them private. Don’t rely on preprocessing clauses, declare them inside the local scope. There is no reason not to do this in C++, and in fact declaring variables close to usage should always be done in C++; due to this, use of private is a clear sign of code smell in C++.
ejd about 13 years

Konrad - Since the example doesn't say C or C++ and in C89 you can't do what you have shown, I believe it is valid to bring up. On top of that, as I stated, the most common problem in OpenMP that I have seen over the past 12 years is that users don't take the time to go through and look at how the variables are used in their code. Personally I think that the default should be to require the user to state how every variable is to be used. It would certainly reduce the number of problems that occur. Meanwhile, I will go back to working on tools to help people figure out what they did wrong.
Gilad over 10 years

Hey @Konrad Rudolph can you look at my code I don't understand why u say we don't need to use shared and private, i;m working on VS2012
Richard about 6 years

The use of default(none) shared(...) is an important safeguard in ensuring parallelism will work correctly.
Konrad Rudolph about 6 years

@Richard The original code didn’t use default(none). That said, I agree that it’s better to be explicit about what’s being shared, and prevent accidental sharing. I added that to the answer.
tim18 about 6 years

The comments about avoiding private in C++ apply equally to C. You shouldn't be using obsolete (before C99 or std C++) syntax for OpenMP. The example asks for update, may to OpenMP 4.0 as well. I too wonder why default(private) is even allowed, although it won't work here anyway. I don't think there are good ways to eliminate firstprivate/lastprivate except that they aren't needed for the loop index (although some workaround may be needed for omp_cancel).