Openmp nested loop

c loops nested openmp

11,840

Solution 1

because the parallel region is only created once and not n-times like the second?

Kind of. The construction

#pragma omp parallel
{
}

also means allocating work items to threads on '{' and returning threads into thread pool on '}'. It has a lot of thread-to-thread communication. Also, by default waiting threads will go to sleep via OS and some time will be needed for wake thread.

About your middle sample: You can try to limit outer for's parallelity with...

#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
  #pragma omp for
  for(j=i+1;j<n,j++)  //will be parallelized
  {
    doing sth.
  }
  #pragma omp for    
  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }
  // Is there really nothing? - if no - use:
  // won't be parallelized
  #pragma omp single
  { //seq part of outer loop
      printf("Progress... %i\n", i); fflush(stdout);
  }

  // here is the point. Every thread did parallel run of outer loop, but...
  #pramga omp barrier

  //  all loop iterations are syncronized:
  //       thr0   thr1  thr2
  // i      0      0     0
  //     ----   barrier ----
  // i      1      1     1
  //     ----   barrier ----
  // i      2      2     2
  // and so on
}
}

In general, placing parallelity at highest (upper) possible for of for nest is better than placing it on inner loops. If you need sequential execution of some code, use the advanced pragmas (like omp barrier, omp master or omp single) or omp_locks for this code. Any of this way will be faster than starting omp parallel many times

Solution 2

Your full test is very wrong. You did count time for both parts of code and for second one; not the time of first section. Also, second line of printf did measure time of first printf.

First run is very slow because there is a thread startup time here, memory init and cache effects. Also, the heuristics of omp may be autotuned after several parallel regions

My version of your test:

$ cat test.c
#include <stdio.h>
#include <omp.h>

void test( int n, int j)
{
  int i ;
  double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
  t_a = omp_get_wtime() ;
  #pragma omp parallel
  {
    for(i=0;i<n;i++) { }
  }
  t_b = omp_get_wtime() ;
  for(i=0;i<n;i++) {
    #pragma omp parallel
    { }
  }
  t_c = omp_get_wtime() ;
  printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
  printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}

int main(void)
{
  int i, n, j=3  ;
  double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
  printf( "Input n: " ) ;
  scanf( "%d", &n ) ;
  while( j --> 0 ) {
      t_1 = omp_get_wtime();
      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

      t_2 = omp_get_wtime();

      for(i=0;i<n;i++) {
        #pragma omp parallel
        { }
      }
      t_3 = omp_get_wtime();
      printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
      printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
      test(n,j) ;
  }
  return 0 ;
}

I did 3 runs for every n inside the program itself.

Results:

$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101

Only first run of test() is affected. All next results are the same for test and main().

Better and more stable results are from such run (I used gcc-4.6.1 and static build)

$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2  ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219

I did set two omp performance environment variables and limited thread number to 2.

Also. You "paralleled" loop is wrong. (and I reproduced this error in my ^^^ variant) The i variable is shared here:

      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

You should have it as

      #pragma omp parallel
      {
        for(int local_i=0;local_i<n;local_i++) { }
      }

UPDATE7 Your result is for n=1000:

[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257

the 0.001 or 0.02 output of your code is .... the seconds multiplied by 1000, so it is a millisecond (ms). And it is ... around 1 microsecond or 20 microseconds. The granularity of some system clocks (user time or system time output fields of time utility) are from 1 millisecond, 3 ms or 10 ms. 1 microsecond is 2000-3000 CPU ticks (for 2-3GHz CPU). So you can't measure so short time interval without special setup. You should:

disable energy saving of CPU (Intel SpeedStep, AMD ???), which can put CPU in lower-power state by lowering its clock (frequency);
disable dynamic overclocking of CPU (Intel turbostep);
Measure time without help from OS, e.g. by reading TSC (rdtsc asm instruction)
Disable instruction reordering on Out-Of-Order CPUs (only atom is not OOO cpu of current generation) before and after rdtsc by adding a cpuid instruction (or other instruction that will disable reordering)
Do the run on completely free system (0% cpu load on both cpu before you will start a test)
rewrite test in non-interactive way (don't wait user for input with scanf, pass n via argv[1])
Don't use Xserver and slow terminal to output the results
Make interrupts number lower (turn off network, physically; don't play a film in background, don't touch mouse and keyboard)
Do a lot of runs (I mean not program restarting, but restarting of measured part of program; j=100 in my program) and add statistic calculation over results.
Don't run a printf so often (between measures); it will pollute the caches and TLB. Store results internally and output them after all measurements are done.

UPDATE8: As statistical I mean: take several values, 7 or more. Discard first value (or even 2-3 first values if you has high number of values measured). Sort them. Discard ... 10-20 % of maximum and minimum results. Calculate the mean. Literally

double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
  count++; sum+= results[it];
}
mean = sum/count;

11,840

Biler

Updated on June 04, 2022

Comments

Biler almost 2 years

just playing around with openmp. Look at this code fragments:

#pragma omp parallel
{
    for( i =0;i<n;i++)
    {
        doing something
    }
}

and

for( i =0;i<n;i++)
{
  #pragma omp parallel
  {
     doing something
  }
}

Why is the first one a lot more slower (around the factor 5) than the second one? From theory I thought that the first one must be faster, because the parallel region is only created once and not n-times like the second? Can someone explain this to me?

The code i want to parallelise has the following structure:

for(i=0;i<n;i++) //wont be parallelizable
{
  for(j=i+1;j<n;j++)  //will be parallelized
  {
    doing sth.
  }

  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }

}

I made a simple program to measure the time and reproduce my results.

#include <stdio.h>
#include <omp.h>

void test( int n)
{
  int i ;
  double t_a = 0.0, t_b = 0.0 ;


  t_a = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_b = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
}

int main(void)
{
  int i, n   ;
  double t_1 = 0.0, t_2 = 0.0 ;

  printf( "n: " ) ;
  scanf( "%d", &n ) ;

  t_1 = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_2 = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;

  test(n) ;

  return 0 ;
}

If I start it with different n's I always get different results.

n: 30000
directive outside for-loop: 0.881884
directive inside for-loop: 0.073054 
directive outside for-loop: 0.049098
directive inside for-loop: 0.011663 

n: 30000
directive outside for-loop: 0.402774
directive inside for-loop: 0.071588 
directive outside for-loop: 0.049168
directive inside for-loop: 0.012013 

n: 30000
directive outside for-loop: 2.198740
directive inside for-loop: 0.065301 
directive outside for-loop: 0.047911
directive inside for-loop: 0.012152 



n: 1000
directive outside for-loop: 0.355841
directive inside for-loop: 0.079480 
directive outside for-loop: 0.013549
directive inside for-loop: 0.012362 

n: 10000
directive outside for-loop: 0.926234
directive inside for-loop: 0.071098 
directive outside for-loop: 0.023536
directive inside for-loop: 0.012222 

n: 10000
directive outside for-loop: 0.354025
directive inside for-loop: 0.073542 
directive outside for-loop: 0.023607
directive inside for-loop: 0.012292

How can you explain me this difference?!

Results with your version:

Input n: 1000
[2] directive outside for-loop: 0.331396
[2] directive inside for-loop: 0.002864 
[2] directive outside for-loop: 0.011663
[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257

osgx over 12 years

you did a measure wrong way. not 1000*(omp_get_wtime()-t_1) but 1000*(t_2-t_1).
osgx over 12 years

My test work so fast on your PC, and you can't measure it so rough. Check update of my answer

Biler over 12 years

Thx for your response. Why do you explicitly declare the variable i as shared? Isnt this by default the case? My whole example given, would you set the openmp directives exactly the way u did in your last post? I want to know what is the common and in general the optimal way.
osgx over 12 years

Please, reread answer, it was wrong at first verisons. Not. Only iteration variable after omp for pragma will be changed in visibility. You can also use c++-styled for: for(int i=0; i<n;i++);
osgx over 12 years

#pramga omp barrier inside of the #pragma omp parallel is more optimal than #pragma omp parallel inside the loop. I have no idea about common method.
Biler over 12 years

Thx, thats exactly what i remembered from openmp books, but testing this without the inner for loops it is still 5 times slower than the other attempt. The other is nearly as fast as the serial version. Why? I dont get it.
osgx over 12 years

Biler, what is the other attempt? The first code? What n did you test? What are running times? Is thread number equal to CPU number? What are they? What is your compiler and compiler version?
Biler over 12 years

the other attempt is #pragma omp parallel inside the outer for loop. The thread number is equal
Biler over 12 years

Please look the program code i posted in my question. In my original code I the for loops are in a function and I still get the complementary results.
Biler over 12 years

wow thanks for fast and detailed answer! Oh, yeah the time measurement was crap....should ve noticed that. Im using gcc 4.4.5. So, when you initialize the i variable in the for loop, its not shared? And Im even with your version getting complementary results and I dont know why. Look at the main question for the results. I do have a dual core processor in this machine!
osgx over 12 years

try to make j bigger and rerun. Yes, if you write int i; \n #pragma omp parallel \n {for(i...)} - the i will be shared. If you will write int i; \n #pragma omp parallel for \n {for(i...)} (note for in pragma), i will be private. Your time measuring is bad because you try to measure very short (fast) parts of code with not-so-exact system call. Also you did not any statistical manipulation on noisy results.
Biler over 12 years

Misunderstanding...its clear that with for in pragma the loop variable will be private. I meant your correction for(int local_i=0;local_i<n;local_i++) Dont see the difference there and isnt it bad c programming style to initialize the variables where you need them? I made j=20 and it converged....still no change. Please explain me the possible statistical maniplulations and what are noisy results?
osgx over 12 years

OMP FOR or OMP PARALLEL FOR will change iterator to private; OMP PARALLEL will not. Citing OpenMP 3.0 "2.5.1 Loop Construct" page 39, lines 29-30: "If this variable would otherwise be shared, it is implicitly made private in the loop construct.". Loop construct is (page 38, line 11) "#pragma omp for". Pragma Omp parallel will not change any variable to private (section 2.4 of the standard).