Openmp nested loop
Solution 1
because the parallel region is only created once and not n-times like the second?
Kind of. The construction
#pragma omp parallel
{
}
also means allocating work items to threads on '{' and returning threads into thread pool on '}'. It has a lot of thread-to-thread communication. Also, by default waiting threads will go to sleep via OS and some time will be needed for wake thread.
About your middle sample: You can try to limit outer for
's parallelity with...
#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
#pragma omp for
for(j=i+1;j<n,j++) //will be parallelized
{
doing sth.
}
#pragma omp for
for(j=i+1;j<n;j++) //will be parallelized
for(k = i+1;k<n;k++)
{
doing sth.
}
// Is there really nothing? - if no - use:
// won't be parallelized
#pragma omp single
{ //seq part of outer loop
printf("Progress... %i\n", i); fflush(stdout);
}
// here is the point. Every thread did parallel run of outer loop, but...
#pramga omp barrier
// all loop iterations are syncronized:
// thr0 thr1 thr2
// i 0 0 0
// ---- barrier ----
// i 1 1 1
// ---- barrier ----
// i 2 2 2
// and so on
}
}
In general, placing parallelity at highest (upper) possible for
of for
nest is better than placing it on inner loops. If you need sequential execution of some code, use the advanced pragmas (like omp barrier
, omp master
or omp single
) or omp_locks for this code. Any of this way will be faster than starting omp parallel
many times
Solution 2
Your full test is very wrong. You did count time for both parts of code and for second one; not the time of first section. Also, second line of printf did measure time of first printf.
First run is very slow because there is a thread startup time here, memory init and cache effects. Also, the heuristics of omp may be autotuned after several parallel regions
My version of your test:
$ cat test.c
#include <stdio.h>
#include <omp.h>
void test( int n, int j)
{
int i ;
double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
t_a = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
t_b = omp_get_wtime() ;
for(i=0;i<n;i++) {
#pragma omp parallel
{ }
}
t_c = omp_get_wtime() ;
printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}
int main(void)
{
int i, n, j=3 ;
double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
printf( "Input n: " ) ;
scanf( "%d", &n ) ;
while( j --> 0 ) {
t_1 = omp_get_wtime();
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
t_2 = omp_get_wtime();
for(i=0;i<n;i++) {
#pragma omp parallel
{ }
}
t_3 = omp_get_wtime();
printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
test(n,j) ;
}
return 0 ;
}
I did 3 runs for every n inside the program itself.
Results:
$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101
Only first run of test()
is affected. All next results are the same for test
and main()
.
Better and more stable results are from such run (I used gcc-4.6.1 and static build)
$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2 ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219
I did set two omp performance environment variables and limited thread number to 2.
Also. You "paralleled" loop is wrong. (and I reproduced this error in my ^^^ variant) The i variable is shared here:
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
You should have it as
#pragma omp parallel
{
for(int local_i=0;local_i<n;local_i++) { }
}
UPDATE7 Your result is for n=1000:
[2] directive inside for-loop: 0.001188
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257
the 0.001 or 0.02 output of your code is .... the seconds multiplied by 1000, so it is a millisecond (ms). And it is ... around 1 microsecond or 20 microseconds. The granularity of some system clocks (user time
or system time
output fields of time
utility) are from 1 millisecond, 3 ms or 10 ms. 1 microsecond is 2000-3000 CPU ticks (for 2-3GHz CPU). So you can't measure so short time interval without special setup. You should:
- disable energy saving of CPU (Intel SpeedStep, AMD ???), which can put CPU in lower-power state by lowering its clock (frequency);
- disable dynamic overclocking of CPU (Intel turbostep);
- Measure time without help from OS, e.g. by reading TSC (
rdtsc
asm instruction) - Disable instruction reordering on Out-Of-Order CPUs (only atom is not OOO cpu of current generation) before and after
rdtsc
by adding acpuid
instruction (or other instruction that will disable reordering) - Do the run on completely free system (0% cpu load on both cpu before you will start a test)
- rewrite test in non-interactive way (don't wait user for input with
scanf
, pass n viaargv[1]
) - Don't use Xserver and slow terminal to output the results
- Make interrupts number lower (turn off network, physically; don't play a film in background, don't touch mouse and keyboard)
- Do a lot of runs (I mean not program restarting, but restarting of measured part of program; j=100 in my program) and add statistic calculation over results.
- Don't run a printf so often (between measures); it will pollute the caches and TLB. Store results internally and output them after all measurements are done.
UPDATE8: As statistical I mean: take several values, 7 or more. Discard first value (or even 2-3 first values if you has high number of values measured). Sort them. Discard ... 10-20 % of maximum and minimum results. Calculate the mean. Literally
double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
count++; sum+= results[it];
}
mean = sum/count;
Related videos on Youtube
Biler
Updated on June 04, 2022Comments
-
Biler almost 2 years
just playing around with openmp. Look at this code fragments:
#pragma omp parallel { for( i =0;i<n;i++) { doing something } }
and
for( i =0;i<n;i++) { #pragma omp parallel { doing something } }
Why is the first one a lot more slower (around the factor 5) than the second one? From theory I thought that the first one must be faster, because the parallel region is only created once and not n-times like the second? Can someone explain this to me?
The code i want to parallelise has the following structure:
for(i=0;i<n;i++) //wont be parallelizable { for(j=i+1;j<n;j++) //will be parallelized { doing sth. } for(j=i+1;j<n;j++) //will be parallelized for(k = i+1;k<n;k++) { doing sth. } }
I made a simple program to measure the time and reproduce my results.
#include <stdio.h> #include <omp.h> void test( int n) { int i ; double t_a = 0.0, t_b = 0.0 ; t_a = omp_get_wtime() ; #pragma omp parallel { for(i=0;i<n;i++) { } } t_b = omp_get_wtime() ; for(i=0;i<n;i++) { #pragma omp parallel { } } printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ; printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ; } int main(void) { int i, n ; double t_1 = 0.0, t_2 = 0.0 ; printf( "n: " ) ; scanf( "%d", &n ) ; t_1 = omp_get_wtime() ; #pragma omp parallel { for(i=0;i<n;i++) { } } t_2 = omp_get_wtime() ; for(i=0;i<n;i++) { #pragma omp parallel { } } printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ; printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ; test(n) ; return 0 ; }
If I start it with different n's I always get different results.
n: 30000 directive outside for-loop: 0.881884 directive inside for-loop: 0.073054 directive outside for-loop: 0.049098 directive inside for-loop: 0.011663 n: 30000 directive outside for-loop: 0.402774 directive inside for-loop: 0.071588 directive outside for-loop: 0.049168 directive inside for-loop: 0.012013 n: 30000 directive outside for-loop: 2.198740 directive inside for-loop: 0.065301 directive outside for-loop: 0.047911 directive inside for-loop: 0.012152 n: 1000 directive outside for-loop: 0.355841 directive inside for-loop: 0.079480 directive outside for-loop: 0.013549 directive inside for-loop: 0.012362 n: 10000 directive outside for-loop: 0.926234 directive inside for-loop: 0.071098 directive outside for-loop: 0.023536 directive inside for-loop: 0.012222 n: 10000 directive outside for-loop: 0.354025 directive inside for-loop: 0.073542 directive outside for-loop: 0.023607 directive inside for-loop: 0.012292
How can you explain me this difference?!
Results with your version:
Input n: 1000 [2] directive outside for-loop: 0.331396 [2] directive inside for-loop: 0.002864 [2] directive outside for-loop: 0.011663 [2] directive inside for-loop: 0.001188 [1] directive outside for-loop: 0.021092 [1] directive inside for-loop: 0.001327 [1] directive outside for-loop: 0.005238 [1] directive inside for-loop: 0.001048 [0] directive outside for-loop: 0.020812 [0] directive inside for-loop: 0.001188 [0] directive outside for-loop: 0.005029 [0] directive inside for-loop: 0.001257
-
osgx over 12 yearsyou did a measure wrong way. not
1000*(omp_get_wtime()-t_1)
but1000*(t_2-t_1)
. -
osgx over 12 yearsMy test work so fast on your PC, and you can't measure it so rough. Check update of my answer
-
-
Biler over 12 yearsThx for your response. Why do you explicitly declare the variable i as shared? Isnt this by default the case? My whole example given, would you set the openmp directives exactly the way u did in your last post? I want to know what is the common and in general the optimal way.
-
osgx over 12 yearsPlease, reread answer, it was wrong at first verisons. Not. Only iteration variable after
omp for
pragma will be changed in visibility. You can also use c++-styled for:for(int i=0; i<n;i++);
-
osgx over 12 years
#pramga omp barrier
inside of the#pragma omp parallel
is more optimal than#pragma omp parallel
inside the loop. I have no idea about common method. -
Biler over 12 yearsThx, thats exactly what i remembered from openmp books, but testing this without the inner for loops it is still 5 times slower than the other attempt. The other is nearly as fast as the serial version. Why? I dont get it.
-
osgx over 12 yearsBiler, what is the
other attempt
? The first code? Whatn
did you test? What are running times? Is thread number equal to CPU number? What are they? What is your compiler and compiler version? -
Biler over 12 yearsthe other attempt is #pragma omp parallel inside the outer for loop. The thread number is equal
-
Biler over 12 yearsPlease look the program code i posted in my question. In my original code I the for loops are in a function and I still get the complementary results.
-
Biler over 12 yearswow thanks for fast and detailed answer! Oh, yeah the time measurement was crap....should ve noticed that. Im using gcc 4.4.5. So, when you initialize the i variable in the for loop, its not shared? And Im even with your version getting complementary results and I dont know why. Look at the main question for the results. I do have a dual core processor in this machine!
-
osgx over 12 yearstry to make j bigger and rerun. Yes, if you write
int i; \n #pragma omp parallel \n {for(i...)}
- the i will be shared. If you will writeint i; \n #pragma omp parallel for \n {for(i...)}
(note for in pragma), i will be private. Your time measuring is bad because you try to measure very short (fast) parts of code with not-so-exact system call. Also you did not any statistical manipulation on noisy results. -
Biler over 12 yearsMisunderstanding...its clear that with for in pragma the loop variable will be private. I meant your correction for(int local_i=0;local_i<n;local_i++) Dont see the difference there and isnt it bad c programming style to initialize the variables where you need them? I made j=20 and it converged....still no change. Please explain me the possible statistical maniplulations and what are noisy results?
-
osgx over 12 years
OMP FOR
orOMP PARALLEL FOR
will change iterator toprivate
;OMP PARALLEL
will not. Citing OpenMP 3.0 "2.5.1 Loop Construct" page 39, lines 29-30: "If this variable would otherwise be shared, it is implicitly made private in the loop construct.". Loop construct is (page 38, line 11) "#pragma omp for
". PragmaOmp parallel
will not change any variable to private (section 2.4 of the standard).