Openmp nested loop

11,840

Solution 1

because the parallel region is only created once and not n-times like the second?

Kind of. The construction

#pragma omp parallel
{
}

also means allocating work items to threads on '{' and returning threads into thread pool on '}'. It has a lot of thread-to-thread communication. Also, by default waiting threads will go to sleep via OS and some time will be needed for wake thread.

About your middle sample: You can try to limit outer for's parallelity with...

#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
  #pragma omp for
  for(j=i+1;j<n,j++)  //will be parallelized
  {
    doing sth.
  }
  #pragma omp for    
  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }
  // Is there really nothing? - if no - use:
  // won't be parallelized
  #pragma omp single
  { //seq part of outer loop
      printf("Progress... %i\n", i); fflush(stdout);
  }

  // here is the point. Every thread did parallel run of outer loop, but...
  #pramga omp barrier

  //  all loop iterations are syncronized:
  //       thr0   thr1  thr2
  // i      0      0     0
  //     ----   barrier ----
  // i      1      1     1
  //     ----   barrier ----
  // i      2      2     2
  // and so on
}
}

In general, placing parallelity at highest (upper) possible for of for nest is better than placing it on inner loops. If you need sequential execution of some code, use the advanced pragmas (like omp barrier, omp master or omp single) or omp_locks for this code. Any of this way will be faster than starting omp parallel many times

Solution 2

Your full test is very wrong. You did count time for both parts of code and for second one; not the time of first section. Also, second line of printf did measure time of first printf.

First run is very slow because there is a thread startup time here, memory init and cache effects. Also, the heuristics of omp may be autotuned after several parallel regions

My version of your test:

$ cat test.c
#include <stdio.h>
#include <omp.h>

void test( int n, int j)
{
  int i ;
  double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
  t_a = omp_get_wtime() ;
  #pragma omp parallel
  {
    for(i=0;i<n;i++) { }
  }
  t_b = omp_get_wtime() ;
  for(i=0;i<n;i++) {
    #pragma omp parallel
    { }
  }
  t_c = omp_get_wtime() ;
  printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
  printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}

int main(void)
{
  int i, n, j=3  ;
  double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
  printf( "Input n: " ) ;
  scanf( "%d", &n ) ;
  while( j --> 0 ) {
      t_1 = omp_get_wtime();
      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

      t_2 = omp_get_wtime();

      for(i=0;i<n;i++) {
        #pragma omp parallel
        { }
      }
      t_3 = omp_get_wtime();
      printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
      printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
      test(n,j) ;
  }
  return 0 ;
}

I did 3 runs for every n inside the program itself.

Results:

$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101

Only first run of test() is affected. All next results are the same for test and main().

Better and more stable results are from such run (I used gcc-4.6.1 and static build)

$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2  ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219

I did set two omp performance environment variables and limited thread number to 2.

Also. You "paralleled" loop is wrong. (and I reproduced this error in my ^^^ variant) The i variable is shared here:

      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

You should have it as

      #pragma omp parallel
      {
        for(int local_i=0;local_i<n;local_i++) { }
      }

UPDATE7 Your result is for n=1000:

[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257 

the 0.001 or 0.02 output of your code is .... the seconds multiplied by 1000, so it is a millisecond (ms). And it is ... around 1 microsecond or 20 microseconds. The granularity of some system clocks (user time or system time output fields of time utility) are from 1 millisecond, 3 ms or 10 ms. 1 microsecond is 2000-3000 CPU ticks (for 2-3GHz CPU). So you can't measure so short time interval without special setup. You should:

  1. disable energy saving of CPU (Intel SpeedStep, AMD ???), which can put CPU in lower-power state by lowering its clock (frequency);
  2. disable dynamic overclocking of CPU (Intel turbostep);
  3. Measure time without help from OS, e.g. by reading TSC (rdtsc asm instruction)
  4. Disable instruction reordering on Out-Of-Order CPUs (only atom is not OOO cpu of current generation) before and after rdtsc by adding a cpuid instruction (or other instruction that will disable reordering)
  5. Do the run on completely free system (0% cpu load on both cpu before you will start a test)
  6. rewrite test in non-interactive way (don't wait user for input with scanf, pass n via argv[1])
  7. Don't use Xserver and slow terminal to output the results
  8. Make interrupts number lower (turn off network, physically; don't play a film in background, don't touch mouse and keyboard)
  9. Do a lot of runs (I mean not program restarting, but restarting of measured part of program; j=100 in my program) and add statistic calculation over results.
  10. Don't run a printf so often (between measures); it will pollute the caches and TLB. Store results internally and output them after all measurements are done.

UPDATE8: As statistical I mean: take several values, 7 or more. Discard first value (or even 2-3 first values if you has high number of values measured). Sort them. Discard ... 10-20 % of maximum and minimum results. Calculate the mean. Literally

double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
  count++; sum+= results[it];
}
mean = sum/count;
Share:
11,840

Related videos on Youtube

Biler
Author by

Biler

Updated on June 04, 2022

Comments

  • Biler
    Biler almost 2 years

    just playing around with openmp. Look at this code fragments:

    #pragma omp parallel
    {
        for( i =0;i<n;i++)
        {
            doing something
        }
    }
    

    and

    for( i =0;i<n;i++)
    {
      #pragma omp parallel
      {
         doing something
      }
    }
    

    Why is the first one a lot more slower (around the factor 5) than the second one? From theory I thought that the first one must be faster, because the parallel region is only created once and not n-times like the second? Can someone explain this to me?

    The code i want to parallelise has the following structure:

    for(i=0;i<n;i++) //wont be parallelizable
    {
      for(j=i+1;j<n;j++)  //will be parallelized
      {
        doing sth.
      }
    
      for(j=i+1;j<n;j++)  //will be parallelized
        for(k = i+1;k<n;k++)
        {
          doing sth.
        }
    
    }
    

    I made a simple program to measure the time and reproduce my results.

    #include <stdio.h>
    #include <omp.h>
    
    void test( int n)
    {
      int i ;
      double t_a = 0.0, t_b = 0.0 ;
    
    
      t_a = omp_get_wtime() ;
    
      #pragma omp parallel
      {
        for(i=0;i<n;i++)
        {
    
        }
      }
    
      t_b = omp_get_wtime() ;
    
      for(i=0;i<n;i++)
      {
        #pragma omp parallel
        {
        }
      }
    
      printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
      printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
    }
    
    int main(void)
    {
      int i, n   ;
      double t_1 = 0.0, t_2 = 0.0 ;
    
      printf( "n: " ) ;
      scanf( "%d", &n ) ;
    
      t_1 = omp_get_wtime() ;
    
      #pragma omp parallel
      {
        for(i=0;i<n;i++)
        {
    
        }
      }
    
      t_2 = omp_get_wtime() ;
    
      for(i=0;i<n;i++)
      {
        #pragma omp parallel
        {
        }
      }
    
      printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
      printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;
    
      test(n) ;
    
      return 0 ;
    }
    

    If I start it with different n's I always get different results.

    n: 30000
    directive outside for-loop: 0.881884
    directive inside for-loop: 0.073054 
    directive outside for-loop: 0.049098
    directive inside for-loop: 0.011663 
    
    n: 30000
    directive outside for-loop: 0.402774
    directive inside for-loop: 0.071588 
    directive outside for-loop: 0.049168
    directive inside for-loop: 0.012013 
    
    n: 30000
    directive outside for-loop: 2.198740
    directive inside for-loop: 0.065301 
    directive outside for-loop: 0.047911
    directive inside for-loop: 0.012152 
    
    
    
    n: 1000
    directive outside for-loop: 0.355841
    directive inside for-loop: 0.079480 
    directive outside for-loop: 0.013549
    directive inside for-loop: 0.012362 
    
    n: 10000
    directive outside for-loop: 0.926234
    directive inside for-loop: 0.071098 
    directive outside for-loop: 0.023536
    directive inside for-loop: 0.012222 
    
    n: 10000
    directive outside for-loop: 0.354025
    directive inside for-loop: 0.073542 
    directive outside for-loop: 0.023607
    directive inside for-loop: 0.012292 
    

    How can you explain me this difference?!

    Results with your version:

    Input n: 1000
    [2] directive outside for-loop: 0.331396
    [2] directive inside for-loop: 0.002864 
    [2] directive outside for-loop: 0.011663
    [2] directive inside for-loop: 0.001188 
    [1] directive outside for-loop: 0.021092
    [1] directive inside for-loop: 0.001327 
    [1] directive outside for-loop: 0.005238
    [1] directive inside for-loop: 0.001048 
    [0] directive outside for-loop: 0.020812
    [0] directive inside for-loop: 0.001188 
    [0] directive outside for-loop: 0.005029
    [0] directive inside for-loop: 0.001257 
    
    • osgx
      osgx over 12 years
      you did a measure wrong way. not 1000*(omp_get_wtime()-t_1) but 1000*(t_2-t_1).
    • osgx
      osgx over 12 years
      My test work so fast on your PC, and you can't measure it so rough. Check update of my answer
  • Biler
    Biler over 12 years
    Thx for your response. Why do you explicitly declare the variable i as shared? Isnt this by default the case? My whole example given, would you set the openmp directives exactly the way u did in your last post? I want to know what is the common and in general the optimal way.
  • osgx
    osgx over 12 years
    Please, reread answer, it was wrong at first verisons. Not. Only iteration variable after omp for pragma will be changed in visibility. You can also use c++-styled for: for(int i=0; i<n;i++);
  • osgx
    osgx over 12 years
    #pramga omp barrier inside of the #pragma omp parallel is more optimal than #pragma omp parallel inside the loop. I have no idea about common method.
  • Biler
    Biler over 12 years
    Thx, thats exactly what i remembered from openmp books, but testing this without the inner for loops it is still 5 times slower than the other attempt. The other is nearly as fast as the serial version. Why? I dont get it.
  • osgx
    osgx over 12 years
    Biler, what is the other attempt? The first code? What n did you test? What are running times? Is thread number equal to CPU number? What are they? What is your compiler and compiler version?
  • Biler
    Biler over 12 years
    the other attempt is #pragma omp parallel inside the outer for loop. The thread number is equal
  • Biler
    Biler over 12 years
    Please look the program code i posted in my question. In my original code I the for loops are in a function and I still get the complementary results.
  • Biler
    Biler over 12 years
    wow thanks for fast and detailed answer! Oh, yeah the time measurement was crap....should ve noticed that. Im using gcc 4.4.5. So, when you initialize the i variable in the for loop, its not shared? And Im even with your version getting complementary results and I dont know why. Look at the main question for the results. I do have a dual core processor in this machine!
  • osgx
    osgx over 12 years
    try to make j bigger and rerun. Yes, if you write int i; \n #pragma omp parallel \n {for(i...)} - the i will be shared. If you will write int i; \n #pragma omp parallel for \n {for(i...)} (note for in pragma), i will be private. Your time measuring is bad because you try to measure very short (fast) parts of code with not-so-exact system call. Also you did not any statistical manipulation on noisy results.
  • Biler
    Biler over 12 years
    Misunderstanding...its clear that with for in pragma the loop variable will be private. I meant your correction for(int local_i=0;local_i<n;local_i++) Dont see the difference there and isnt it bad c programming style to initialize the variables where you need them? I made j=20 and it converged....still no change. Please explain me the possible statistical maniplulations and what are noisy results?
  • osgx
    osgx over 12 years
    OMP FOR or OMP PARALLEL FOR will change iterator to private; OMP PARALLEL will not. Citing OpenMP 3.0 "2.5.1 Loop Construct" page 39, lines 29-30: "If this variable would otherwise be shared, it is implicitly made private in the loop construct.". Loop construct is (page 38, line 11) "#pragma omp for". Pragma Omp parallel will not change any variable to private (section 2.4 of the standard).