C OpenMP parallel bubble sort
You should profile it and check where threads spend time.
One possible reason is that parallel regions are constantly created and destroyed; depending on OpenMP implementation, it could lead to re-creation of the thread pool, though good implementations should probably handle this case.
Some small things to shave off:
- ok
seems completely unnecessary, you can just change the loop exit condition to i<n-1
;
- explicit barrier is unnecessary - first, you put it out of parallel regions so it makes no sense; and second, OpenMP parallel regions and loops have implicit barriers at the end;
- combine at least the two consequent parallel regions inside the while loop:
#pragma omp parallel private(tmp)
{
#pragma omp for bla-bla
for (i=0; i<n-1; i+=2 ) {...}
#pragma omp for bla-bla
for (i=1; i<n-1; i+=2 ) {...}
}
Comments
-
Dan Lincan about 2 years
I have an implementation of parallel bubble sort algorithm(Odd-Even transposition sort) in C, using OpenMP. However, after I tested it it's slower than the serial version(by about 10%) although I have a 4 cores processor ( 2 real x 2 because of Intel hyperthreading). I have checked to see if the cores are actually used and I can see them at 100% each when running the program. Therefore I think I did a mistake in the implementation the algorithm.
I am using linux with kernel 2.6.38-8-generic.
This is how I compile:
gcc -o bubble-sort bubble-sort.c -Wall -fopenmp
orgcc -o bubble-sort bubble-sort.c -Wall -fopenmp
for the serial versionThis is how i run:
./bubble-sort < in_10000 > out_10000
#include <omp.h> #include <stdio.h> #include <time.h> #include <stdlib.h> int main() { int i, n, tmp, *x, changes; int chunk; scanf("%d ", &n); chunk = n / 4; x = (int*) malloc(n * sizeof(int)); for(i = 0; i < n; ++i) scanf("%d ", &x[i]); changes = 1; int nr = 0; while(changes) { #pragma omp parallel private(tmp) { nr++; changes = 0; #pragma omp for \ reduction(+:changes) for(i = 0; i < n - 1; i = i + 2) { if(x[i] > x[i+1] ) { tmp = x[i]; x[i] = x[i+1]; x[i+1] = tmp; ++changes; } } #pragma omp for \ reduction(+:changes) for(i = 1; i < n - 1; i = i + 2) { if( x[i] > x[i+1] ) { tmp = x[i]; x[i] = x[i+1]; x[i+1] = tmp; ++changes; } } } } return 0; }
Later edit:
It seems to work well now after I made the changes you suggested. It also scales pretty good(I tested on 8 physical cores too -> took 21s for a set of 150k numbers which is far less than on one core). However if I set the OMP_SCHEDULE environment variable myself the performance decreases...