not sure what should be SHARED or PRIVATE in openmp loop

loops parallel-processing fortran openmp

18,541

Solution 1

The golden rule of OpenMP is that all variables (with some exclusions), that are defined in an outer scope, are shared by default in the parallel region. Since in Fortran before 2008 there are no local scopes (i.e. there is no BLOCK ... END BLOCK in earlier versions), all variables (except threadprivate ones) are shared, which is very natural for me (unlike Ian Bush, I am not a big fan of using default(none) and then redeclaring the visibility of all 100+ local variables in various complex scientific codes).

Here is how to determine the sharing class of each variable:

N - shared, because it should be the same in all threads and they only read its value.
ii - it is the counter of loop, subject to a worksharing directive, so its sharing class is predetermined to be private. It doesn't hurt to explicitly declare it in a PRIVATE clause, but that is not really necessary.
jj - loop counter of a loop, which is not subject to a worksharing directive, hence jj should be private.
X - shared, because all threads reference and only read from it.
distance_vector - obviously should be private as each thread works on different pairs of particles.
distance, distance2, and coff - ditto.
M - should be shared for the same reasons as X.
PE - acts as an accumulator variable (I guess this is the potential energy of the system) and should be a subject of an reduction operation, i.e. should be put in a REDUCTION(+:....) clause.
A - this one is tricky. It could be either shared and updates to A(jj,:) protected with synchronising constructs, or you could use reduction (OpenMP allows reductions over array variables in Fortran unlike in C/C++). A(ii,:) is never modified by more than one thread so it does not need special treatment.

With reduction over A in place, each thread would get its private copy of A and this could be a memory hog, although I doubt you would use this direct O(N²) simulation code to compute systems with very large number of particles. There is also a certain overhead associated with the reduction implementation. In this case you simply need to add A to the list of the REDUCTION(+:...) clause.

With synchronising constructs you have two options. You could either use the ATOMIC construct or the CRITICAL construct. As ATOMIC is only applicable to scalar contexts, you would have to "unvectorise" the assignment loop and apply ATOMIC to each statement separately, e.g.:

!$OMP ATOMIC UPDATE
A(jj,1)=A(jj,1)+(M(ii)/coff)*(distance_vector(1))
!$OMP ATOMIC UPDATE
A(jj,2)=A(jj,2)+(M(ii)/coff)*(distance_vector(2))
!$OMP ATOMIC UPDATE
A(jj,3)=A(jj,3)+(M(ii)/coff)*(distance_vector(3))

You may also rewrite this as a loop - do not forget to declare the loop counter private.

With CRITICAL there is no need to unvectorise the loop:

!$OMP CRITICAL (forceloop)
A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)
!$OMP END CRITICAL (forceloop)

Naming critical regions is optional and a bit unnecessary in this particular case but in general it allows to separate unrelated critical regions.

Which is faster? Unrolled with ATOMIC or CRITICAL? It depends on many things. Usually CRITICAL is way slower since it often involves function calls to the OpenMP runtime while atomic increments, at least on x86, are implemented with locked addition instructions. As they often say, YMMV.

To recapitulate, a working version of your loop should be something like:

!$OMP PARALLEL DO PRIVATE(jj,kk,distance_vector,distance2,distance,coff) &
!$OMP& REDUCTION(+:PE)
do ii=1,N-1
   do jj=ii+1,N
      distance_vector=X(ii,:)-X(jj,:)
      distance2=sum(distance_vector*distance_vector)
      distance=DSQRT(distance2)
      coff=distance*distance*distance
      PE=PE-M(II)*M(JJ)/distance
      do kk=1,3
         !$OMP ATOMIC UPDATE
         A(jj,kk)=A(jj,kk)+(M(ii)/coff)*(distance_vector(kk))
      end do
      A(ii,:)=A(ii,:)-(M(jj)/coff)*(distance_vector)
   end do
end do
!$OMP END PARALLEL DO

I've assumed that your system is 3-dimensional.

With all this said, I second Ian Bush that you need to rethink how position and acceleration matrices are laid out in memory. Proper cache usage could boost your code and would also allow for certain operations, e.g. X(:,ii)-X(:,jj) to be vectorised, i.e. implemented using vector SIMD instructions.

Solution 2

As written you will need some synchronisation to avoid a race condition. Consider the 2 thread case. Say thread 0 start with ii=1, and so considers jj=2,3,4, .... and thread 1 starts with ii=2, and so considers jj=3,4,5,6. Thus as written it is possible that thread 0 is considering ii=1,jj=3 and thread 1 is looking at ii=2,jj=3 at the same time. This obviously could cause problems at the line

                        A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)

as both threads have the same value of jj. So yes, you do need to synchronize the updates to A to avoid a race, though I must admit I good way isn't immediately obvious to me. I'll think on it and edit if something occurs to me.

However I've got 3 other comments:

1) Your memory access pattern is horrible, and correcting this will, I expect, give at least as much speed up as any openmp with a lot less hassle. In Fortran you want to go down the first index fastest - this makes sure that memory accesses are spatially local and so ensures good use of the memory hierarchy. Given that this is the most important thing for good performance on a modern machine you should really try to get this right. So the above would be better if you can arrange the arrays so that the above can be written as

        do ii=1,N-1
                do jj=ii+1,N
                        distance_vector=X(:,ii)-X(:jj)
                        distance2=sum(distance_vector*distance_vector)
                        distance=DSQRT(distance2)
                        coff=distance*distance*distance
                        PE=PE-M(II)*M(JJ)/distance
                        A(:,jj)=A(:,jj)+(M(ii)/coff)*(distance_vector)
                        A(:,ii)=A(:,ii)-(M(jj)/coff)*(distance_vector)
                end do
        end do

Note how this goes down the first index, rather than the second as you have.

2) If you do use openmp I strongly suggest you use default(None), it helps avoid nasty bugs. If you were one of my students you'd lose loads of marks for not doing this!

3) Dsqrt is archaic - in modern Fortran (i.e. anything after 1967) in all but a few obscure cases sqrt is plenty good enough, and is more flexible

18,541

Author by

Griff

Updated on June 05, 2022

Comments

Griff almost 2 years

I have a loop which updates a matrix A and I want to make it openmp but I'm not sure what variables should be shared and private. I would have thought just ii and jj would work but it doesn't. I think I need an !$OMP ATOMIC UPDATE somewhere too...

The loop just calculates the distance between N and N-1 particles and updates a matrix A.

            !$OMP PARALLEL DO PRIVATE(ii,jj)
            do ii=1,N-1
                    do jj=ii+1,N
                            distance_vector=X(ii,:)-X(jj,:)
                            distance2=sum(distance_vector*distance_vector)
                            distance=DSQRT(distance2)
                            coff=distance*distance*distance
                            PE=PE-M(II)*M(JJ)/distance
                            A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)
                            A(ii,:)=A(ii,:)-(M(jj)/coff)*(distance_vector)
                    end do
            end do
            !$OMP END PARALLEL DO

Griff over 11 years

I appreciate your help. If I do improve the memory, I am still left with the problem of what is shared and private. ii and jj obviously but simply adding !OMP ATOMIC UPDATE before A(:,jj) and A(:,ii) doesn't work.
Ian Bush over 11 years

Yeah - fair comment. ii and jj should be private as each thread will be working on its own combination of iterations. The simplest way to solve the race is to use omp critical to protect the problem line. I must admit I'm not sure if you can use atomic with lines that involve array syntax - my guess is that you can't.
sigma over 11 years

As a side question: is exponentiation faster with a*a*a instead of a**3?
Bort over 11 years

Besides Ian Bush's answer, you need to keep in mind, that every variable in an threaded envornment, you do an assignment to, must be either accessed atomic or thread private. So distance_vector, distance2, distance, coff, PE must all be thread_private.
Griff over 11 years

Thankyou Hristo, again. Just a really basic question - why does it have to be that complicated for simpler loops. I mean you just want the very outer loop to be split nthread ways and populate some matrix (at least in the stuff I do). I can see why you need private etc. for complex cases and so it is a vase of 'extra engineering' but why not just state !$OMP PARALLELL DO SHARED(MATRIX) then it just splits N/nthreads and delivers. Everything in each thread will be done in sequence. I guess I'm thinking - why not make everything private by default, not shared, then just set the shared.
Hristo Iliev over 11 years

@Griff, in C/C++ these variables can be declared inside the innermost loop and then they are automatically private. That they are shared stems from the desire to have the semantics of OpenMP match that of the serial code. By the way, you can get rid of the atomics if you forget for a second about Newton's third law and run both loop over 1,N with assignment to A(ii,:) only (ineffective though).
Griff over 11 years

Is there anyway I can show you a new code without having to post it as a question for you to check? It is somewhat simpler, but longer and I only want to check it by you to ensure I understand it correctly in words. Thanks.
Hristo Iliev over 11 years

You can upload it to Pastebin and then post a shared link here. But I cannot respond until tomorrow.
Hristo Iliev over 11 years

By the way, there is the DEFAULT(PRIVATE) clause. It makes all variables private by default and then you have to describe the shared and reduced ones. I will edit my answer tomorrow when I get my hands on a computer.
Griff over 11 years

Thanks Hristo. This is the pastebin of something I'm working on. pastebin.com/V5CNcSbt
M. S. B. over 11 years

aaa versus a3? Write a3 as clearer to anyone reading the code. It is very likely that the compiler will emit machine instructions for aaa if that is significantly faster.
Hristo Iliev over 11 years

@Griff, Ntot, x0, y0, z0, hsml, hmin, hmax, Xc, Yc, Zc, Lx, Ly, Lz, pixelsizeX, pixelsizeY, pixelsizeZ, num_pixelsX, num_pixelsY, and num_pixelsZ should all be shared. binned_particles should get reduction(+:...).
Hristo Iliev over 11 years

Also I've noticed that on line 12 and 14 you've written h instead of hsmltemp. May be you should put IMPLICIT NONE at the beginning of the subroutine in order to catch such typos.
Griff over 11 years

let us continue this discussion in chat
Hristo Iliev over 11 years

@Griff, I've written something in the chat. I don't know if you'll ever get notified, that's why I'm sending you a ping here.
Griff over 11 years

pinging you back - not sure if you'll see it though.
Hristo Iliev over 11 years

@Griff, this chat thing is a bit useless if we have to ping each other outside it. Whatever... I have added some comments there.