C OpenMP parallel bubble sort

15,965

You should profile it and check where threads spend time.

One possible reason is that parallel regions are constantly created and destroyed; depending on OpenMP implementation, it could lead to re-creation of the thread pool, though good implementations should probably handle this case.

Some small things to shave off:
- ok seems completely unnecessary, you can just change the loop exit condition to i<n-1;
- explicit barrier is unnecessary - first, you put it out of parallel regions so it makes no sense; and second, OpenMP parallel regions and loops have implicit barriers at the end;
- combine at least the two consequent parallel regions inside the while loop:

#pragma omp parallel private(tmp)
{
    #pragma omp for bla-bla
    for (i=0; i<n-1; i+=2 ) {...}

    #pragma omp for bla-bla
    for (i=1; i<n-1; i+=2 ) {...}
}
Share:
15,965
Dan Lincan
Author by

Dan Lincan

"It works, but I have no idea why..."

Updated on June 04, 2022

Comments

  • Dan Lincan
    Dan Lincan about 2 years

    I have an implementation of parallel bubble sort algorithm(Odd-Even transposition sort) in C, using OpenMP. However, after I tested it it's slower than the serial version(by about 10%) although I have a 4 cores processor ( 2 real x 2 because of Intel hyperthreading). I have checked to see if the cores are actually used and I can see them at 100% each when running the program. Therefore I think I did a mistake in the implementation the algorithm.

    I am using linux with kernel 2.6.38-8-generic.

    This is how I compile:

    gcc -o bubble-sort bubble-sort.c -Wall -fopenmp or

    gcc -o bubble-sort bubble-sort.c -Wall -fopenmp for the serial version

    This is how i run:

    ./bubble-sort < in_10000 > out_10000

    #include <omp.h>
    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    
    int main()
    {
            int i, n, tmp, *x, changes;
            int chunk;
            scanf("%d ", &n);
            chunk = n / 4;
            x = (int*) malloc(n * sizeof(int));
            for(i = 0; i < n; ++i)
                scanf("%d ", &x[i]);
        changes = 1;
        int nr = 0;
        while(changes)
        {
        #pragma omp parallel private(tmp)
        {
                nr++;
                changes = 0;
                #pragma omp for \
                        reduction(+:changes)
                for(i = 0; i < n - 1; i = i + 2)
                {
                        if(x[i] > x[i+1] )
                        {
                                tmp = x[i];
                                x[i] = x[i+1];
                                x[i+1] = tmp;
                                ++changes;
                        }
                }
                #pragma omp for \
                        reduction(+:changes)
                for(i = 1; i < n - 1; i = i + 2)
                {
                        if( x[i] > x[i+1] )
                        {
                                tmp = x[i];
                                x[i] = x[i+1];
                                x[i+1] = tmp;
                                ++changes;
                        }
                }
        }
        }
    
        return 0;
    }
    

    Later edit:

    It seems to work well now after I made the changes you suggested. It also scales pretty good(I tested on 8 physical cores too -> took 21s for a set of 150k numbers which is far less than on one core). However if I set the OMP_SCHEDULE environment variable myself the performance decreases...