Improving the Quick sort

c performance sorting micro-optimization quicksort

11,407

Solution 1

The quicksort algorithm is easily parallelized. If you have multiple cores to work with, you could see quite a bit of speed up. Depending on how large your data set is, it could easily provide you with more speed up than any other optimization. However, if you have only one processor or a relatively small data set, there won't be much of a speed up.

I could elaborate more if this is a possibility.

Solution 2

Choose a better pivot: eg. in median-of-three you pick 3 (random) elements and choose the pivot as the median element
When length(a[]) < M (in practice choose M = 9) stop sorting. After qsort() finished apply insert sort which would take roughly M * N = O(N). This avoids many function calls close to leaf of the divide-et-impera recursion tree.

Solution 3

The first suggestion would be: replace one of the recursive calls with iteration. And I mean real iteration, not a manually implemented stack for recursion. I.e. instead of making two "new" calls to quick from quick, "recycle" your current call to quick to process one branch of recursion, and call quick recursively to process another branch.

Now, if you make sure that you always make real recursive call for the shorter partition (and use iteration for the longer one), it will guarantee that the depth of recursion will never exceed log N even in the worst case, i.e. regardless of how well you choose your median.

This all is implemented in qsort algorithm that comes with GCC standard library. Take a look at the source, it should be useful.

Roughly, a more practical implementation that follows the above suggestion might look as follows

void quick(int a[], int lower, int upper)
{
  while (lower < upper)
  {
    int loc = partition(a, lower, upper);
    if (loc - lower < upper - loc)
    { /* Lower part is shorter... */
      quick(a, lower, loc - 1); /* ...process it recursively... */
      lower = loc + 1; /* ...and process the upper part on the next iteration */
    }
    else
    { /* Upper part is shorter... */
      quick(a, loc + 1, upper); /* ...process it recursively... */
      upper = loc - 1; /* ...and process the lower part on the next iteration */
    }
  }
}

This is just a sketch of the idea, of course. Not tested. Again, take a look at GCC implementation for the same idea. They also replace the remaining recursive call with "manual" recursion, but it is not really necessary.

Solution 4

Sorting small blocks (<5 elements) with a loopless algorithm may improve performance. I found only one example how to write a loopless sorting algorithm for 5 elements: http://wiki.tcl.tk/4118

The idea can be used to generate loopless sorting algorithms for 2,3,4,5 elements in C.

EDIT: i tried it on one set of data, and i measured 87% running time compared to the code in the question. Using insertion sort on <20 blocks resulted 92% on the same data set. This measurement is not representative but may show that this is a way how can You improve Your quicksort code.

EDIT: this example code write out loopless sorting functions for 2-6 elements:

#include <stdio.h>

#define OUT(i,fmt,...)  printf("%*.s"fmt"\n",i,"",##__VA_ARGS__);

#define IF( a,b, i, block1, block2 ) \
  OUT(i,"if(t[%i]>t[%i]){",a,b) block1 OUT(i,"}else{") block2 OUT(i,"}")

#define AB2(i,a,b)         IF(a,b,i,P2(i+2,b,a),P2(i+2,a,b))
#define  P2(i,a,b)         print_perm(i,2,(int[2]){a,b});

#define AB3(i,a,b,c)       IF(a,b,i,BC3(i+2,b,a,c),BC3(i+2,a,b,c))
#define AC3(i,a,b,c)       IF(a,c,i, P3(i+2,c,a,b), P3(i+2,a,c,b))
#define BC3(i,a,b,c)       IF(b,c,i,AC3(i+2,a,b,c), P3(i+2,a,b,c))
#define  P3(i,a,b,c)       print_perm(i,3,(int[3]){a,b,c});

#define AB4(i,a,b,c,d)     IF(a,b,i,CD4(i+2,b,a,c,d),CD4(i+2,a,b,c,d))
#define AC4(i,a,b,c,d)     IF(a,c,i, P4(i+2,c,a,b,d), P4(i+2,a,c,b,d))
#define BC4(i,a,b,c,d)     IF(b,c,i,AC4(i+2,a,b,c,d), P4(i+2,a,b,c,d))
#define BD4(i,a,b,c,d)     IF(b,d,i,BC4(i+2,c,d,a,b),BC4(i+2,a,b,c,d))
#define CD4(i,a,b,c,d)     IF(c,d,i,BD4(i+2,a,b,d,c),BD4(i+2,a,b,c,d))
#define  P4(i,a,b,c,d)     print_perm(i,4,(int[4]){a,b,c,d});

#define AB5(i,a,b,c,d,e)   IF(a,b,i,CD5(i+2,b,a,c,d,e),CD5(i+2,a,b,c,d,e))
#define AC5(i,a,b,c,d,e)   IF(a,c,i, P5(i+2,c,a,b,d,e), P5(i+2,a,c,b,d,e))
#define AE5(i,a,b,c,d,e)   IF(a,e,i,CB5(i+2,e,a,c,b,d),CB5(i+2,a,e,c,b,d))
#define BE5(i,a,b,c,d,e)   IF(b,e,i,AE5(i+2,a,b,c,d,e),DE5(i+2,a,b,c,d,e))
#define BD5(i,a,b,c,d,e)   IF(b,d,i,BE5(i+2,c,d,a,b,e),BE5(i+2,a,b,c,d,e))
#define CB5(i,a,b,c,d,e)   IF(c,b,i,DC5(i+2,a,b,c,d,e),AC5(i+2,a,b,c,d,e))
#define CD5(i,a,b,c,d,e)   IF(c,d,i,BD5(i+2,a,b,d,c,e),BD5(i+2,a,b,c,d,e))
#define DC5(i,a,b,c,d,e)   IF(d,c,i, P5(i+2,a,b,c,d,e), P5(i+2,a,b,d,c,e))
#define DE5(i,a,b,c,d,e)   IF(d,e,i,CB5(i+2,a,b,c,e,d),CB5(i+2,a,b,c,d,e))
#define  P5(i,a,b,c,d,e)   print_perm(i,5,(int[5]){a,b,c,d,e});

#define AB6(i,a,b,c,d,e,f) IF(a,b,i,CD6(i+2,b,a,c,d,e,f),CD6(i+2,a,b,c,d,e,f))
#define AC6(i,a,b,c,d,e,f) IF(a,c,i, P6(i+2,c,a,b,d,e,f), P6(i+2,a,c,b,d,e,f))
#define AE6(i,a,b,c,d,e,f) IF(a,e,i,CB6(i+2,e,a,c,b,d,f),CB6(i+2,a,e,c,b,d,f))
#define BD6(i,a,b,c,d,e,f) IF(b,d,i,DF6(i+2,c,d,a,b,e,f),DF6(i+2,a,b,c,d,e,f))
#define BE6(i,a,b,c,d,e,f) IF(b,e,i,AE6(i+2,a,b,c,d,e,f),DE6(i+2,a,b,c,d,e,f))
#define CB6(i,a,b,c,d,e,f) IF(c,b,i,DC6(i+2,a,b,c,d,e,f),AC6(i+2,a,b,c,d,e,f))
#define CD6(i,a,b,c,d,e,f) IF(c,d,i,EF6(i+2,a,b,d,c,e,f),EF6(i+2,a,b,c,d,e,f))
#define DB6(i,a,b,c,d,e,f) IF(d,b,i,BE6(i+2,a,b,c,d,e,f),BE6(i+2,c,d,a,b,e,f))
#define DC6(i,a,b,c,d,e,f) IF(d,c,i, P6(i+2,a,b,c,d,e,f), P6(i+2,a,b,d,c,e,f))
#define DE6(i,a,b,c,d,e,f) IF(d,e,i,CB6(i+2,a,b,c,e,d,f),CB6(i+2,a,b,c,d,e,f))
#define DF6(i,a,b,c,d,e,f) IF(d,f,i,DB6(i+2,a,b,e,f,c,d),BE6(i+2,a,b,c,d,e,f))
#define EF6(i,a,b,c,d,e,f) IF(e,f,i,BD6(i+2,a,b,c,d,f,e),BD6(i+2,a,b,c,d,e,f))
#define  P6(i,a,b,c,d,e,f) print_perm(i,6,(int[6]){a,b,c,d,e,f});

#define SORT(ABn,n,...) \
  OUT( 0, ""); \
  OUT( 0, "inline void sort" #n "( int t[" #n "] ){" ) \
  OUT( 2, "int tmp;" ) \
  ABn( 2, __VA_ARGS__ ) \
  OUT( 0, "}" )

void print_perm( int ind, int n, int t[n] ){
  printf("%*.s", ind-1, "");
  for( int i=0; i<n; i++ )
    if( i != t[i] ){
      printf(" tmp=t[%i]; t[%i]=",i,i);
      for( int j=t[i],k; j!=i; k=j,j=t[j],t[k]=k)
        printf("t[%i]; t[%i]=",j,j);
      printf("tmp;");
    }
  printf("\n");
}

int main( void ){
  SORT( AB2, 2, 0,1 )
  SORT( AB3, 3, 0,1,2 )
  SORT( AB4, 4, 0,1,2,3 )
  SORT( AB5, 5, 0,1,2,3,4 )
  SORT( AB6, 6, 0,1,2,3,4,5 )
}

The generated code 3718 lines long:

sort2():    8
sort3():   24
sort4():   96
sort5():  512
sort6(): 3072

Solution 5

Try another sort algorithm.

Depending on your data, you may be able to trade memory for speed.

According to Wikipedia

Quick sort has a best case O(n log n) performance and O(1) space
Merge sort has a fixed O(n log n) performance and O(n) space
Radix sort has a fixed O(n . <number of digits>) perfomance and O(n . <number of digits>) space

Edit

Apparently your data is integers. With 2.5M integers in the range [0, 0x0fffffff], my implementation of radix-sort is about 4 times as fast as your implementation of quick-sort.

$ ./a.out
qsort time: 0.39 secs
radix time: 0.09 secs
good: 2000; evil: 0

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 2560000
#define RANDOM_NUMBER (((rand() << 16) + rand()) & 0x0fffffff)

int partition(int a[], int lower, int upper) {
  int pivot, i, j, temp;
  pivot = a[lower];
  i = lower + 1;
  j = upper;
  while (i < j) {
    while((i < upper) && (a[i] <= pivot)) i++;
    while (a[j] > pivot) j--;
    if (i < j) {
      temp = a[i];
      a[i] = a[j];
      a[j] = temp;
    }
  }
  if (pivot > a[j]) {
    temp = a[j];
    a[j] = a[lower];
    a[lower] = temp;
  }
  return j;
}

void quick(int a[], int lower, int upper) {
  int loc;
  if (lower < upper) {
    loc = partition(a, lower, upper);
    quick(a, lower, loc-1);
    quick(a, loc+1, upper);
  }
}

#define NBUCKETS 256
#define BUCKET_SIZE (48 * (1 + ARRAY_SIZE / NBUCKETS))

/* "waste" memory */
int bucket[NBUCKETS][BUCKET_SIZE];

void radix(int *a, size_t siz) {
  unsigned shift = 0;
  for (int dummy=0; dummy<4; dummy++) {
    int bcount[NBUCKETS] = {0};
    int *aa = a;
    size_t s = siz;
    while (s--) {
      unsigned v = ((unsigned)*aa >> shift) & 0xff;
      if (bcount[v] == BUCKET_SIZE) {
        fprintf(stderr, "not enough memory.\n");
        fprintf(stderr, "v == %u; bcount[v] = %d.\n", v, bcount[v]);
        exit(EXIT_FAILURE);
      }
      bucket[v][bcount[v]++] = *aa++;
    }
    aa = a;
    for (int k=0; k<NBUCKETS; k++) {
      for (int j=0; j<bcount[k]; j++) *aa++ = bucket[k][j];
    }
    shift += 8;
  }
}

int ar1[ARRAY_SIZE];
int ar2[ARRAY_SIZE];

int main(void) {
  clock_t t0;

  srand(time(0));
  for (int k=0; k<ARRAY_SIZE; k++) ar1[k] = ar2[k] = RANDOM_NUMBER;

  t0 = clock(); while (clock() == t0) /* void */; t0 = clock();
  do {
    quick(ar1, 0, ARRAY_SIZE - 1);
  } while (0);
  double qsort_time = (double)(clock() - t0) / CLOCKS_PER_SEC;

  t0 = clock(); while (clock() == t0) /* void */; t0 = clock();
  do {
    radix(ar2, ARRAY_SIZE);
  } while (0);
  double radix_time = (double)(clock() - t0) / CLOCKS_PER_SEC;

  printf("qsort time: %.2f secs\n", qsort_time);
  printf("radix time: %.2f secs\n", radix_time);

  /* compare sorted arrays by sampling */
  int evil = 0;
  int good = 0;
  for (int chk=0; chk<2000; chk++) {
    size_t index = RANDOM_NUMBER % ARRAY_SIZE;
    if (ar1[index] == ar2[index]) good++;
    else evil++;
  }
  printf("good: %d; evil: %d\n", good, evil);

  return 0;
}

View more solutions

11,407

Author by

Ravindra S

Here to learn. Display picture inspired by Carl Sagan.

Updated on June 04, 2022

Comments

Ravindra S almost 2 years

If possible, how can I improve the following quick sort(performance wise). Any suggestions?

void main()
    {
      quick(a,0,n-1);
    }

    void quick(int a[],int lower,int upper)
    {
       int loc;
       if(lower<upper)
       {
        loc=partition(a,lower,upper);
        quick(a,lower,loc-1);
        quick(a,loc+1,upper);

       }
    }

    /* Return type: int
      Parameters passed: Unsorted array and its lower and upper bounds */

    int partition(int a[],int lower,int upper)
    {
      int pivot,i,j,temp;
      pivot=a[lower];
      i=lower+1;
      j=upper;
      while(i<j)
        {
            while((i<upper)&&(a[i]<=pivot))
            i++;
            while((a[j]>pivot))
            j--;
            if(i<j)
                {
                    temp=a[i];
                    a[i]=a[j];
                    a[j]=temp;
                }

        }//end while

        if(pivot>a[j])
        {
             temp=a[j];
             a[j]=a[lower];
             a[lower]=temp;
        }

         return(j);

}//end partition

AnT stands with Russia over 14 years

...a bunch of generic theoretical ideas, but unfortunately virtually no practical ones (which is understandable, since the article has to be implementation/language independent).
AnT stands with Russia over 14 years

The comment in source code of GCC's qsort states that tricks like "median-of-three" produce no real improvement in practice :) I tend to agree with this. Unless your input is somehow specific so that it workd better with "median-of-three" selection method, that is...
Alexandru over 14 years

I think median-of-tree is just a defense against sorted arrays which are much more common that the pattern required to defeat median-of-three. On average (random ordered strings?) it doesn't provide any benefit.
Isak Savo over 14 years

+1 for suggesting something that may actually significantly improve performance. Great!
rpj over 14 years

+1 for the sheer size of your cojones. Love those nested #defines.
AnT stands with Russia over 14 years

Again, this approach unconditionally pushes the lower partition on the stack, while processing the upper partition iteratively. In general case, the depth of stack will be O(n) (unless you guarantee the almost-perfect median), which is not good. If you add another if that makes sure to always push the shorter partition on the stack, the stack depth will never exceed O(log n) even with the worst choice of median.
Goz over 14 years

I'd be interested to see the assembler output from the compiler. A loop-less sort can be done REALLY quickly using the conditional move instructions available on most processors.
sambowry over 14 years

@Goz: see the generated code, i think it can not be compiled with conditional move instructions
beermann over 14 years

I have printed this and showed to my (c#_is_the_best) friend. He is still searching for his jaw. Thank You sambowry :D
DarthVader over 14 years

Actually it has been proven that, median of 5 is better than median of 3.
DarthVader over 14 years

Great answer, but I dont think that s what he s looking for. and most of the divide and conquer algorithms + dynamic programming algorithms can run in parallel.
DarthVader over 14 years

Sorry but i disagree with you, converting recursion to iteration doesnt provide any benefit at all. GCC and most of the compilers optimize recursion using symbol table and cache at the lower levels, dynamic programming. Using iteration over recursion doesnt have any benefit at all. Forgive me for this; Computer Science is based on Recursion Theory and recursive functons.
DarthVader over 14 years

Recursion doesnt have over head, actually. See en.wikipedia.org/wiki/Dynamic_programming
Swiss over 14 years

Honestly, I am a bit disappointed that he never did come back to clarify what he was looking for.
atv over 14 years

@ AndreyT , Thanks for suggestion :) @ unknown(google) , How DP is applicable hear? Recursion creates large number of stack frames(containing return address ,frame pointer etc...) on program stack which is obviously an overhead.Use of Explicit stack will eliminate this overhead.
Alexandru over 14 years

Nowadays the answer to everything is "parallelize" or "cloud computing".
AnT stands with Russia over 14 years

@unknown: Absolutely incorrect. Apparently, you missed the point entirely. The main point of this modification is not performance (which might not improve), but the hard limit on the stack depth. The original implementation had a O(N) recursion depth in worst case. Which meant that on a bad input it would crash because of stack overflow. The implementation I suggested, again, has a guaranteed limit of log2(N) on the recursion depth. I.e. it is guaranteed, for example, that on a 32-bit platform there will never be more than 32 levels of recursion and the stack will never overflow.
AnT stands with Russia over 14 years

@unknown: Note, that the log2(N) guarantee applies regardless of how bad the input is and how bad the median choice method is. Obviously, this is very valuable and very practical benefit. Moreover, this is the most important thing everyone should think about first. The performance comes second. Your verdict about this "providing no benefit at all" is laughable at best. As for the Computer Science reference - I have no idea about the point of that completely irrelevant remark.
Ravindra S over 14 years

I am sorry but my account got suspended for seven days.So in the mean time I was not able to do a single activity on my account.Hence the delay in reply. Because of that your answer got selected as "best" answer.
Admin over 14 years

@beerman: I am too - the code may well be extremely clever, but it is in terms of readability and comprehension extremely awful.
Thorsten79 over 14 years

Actually application developers should better use framework functions for sorting. In the early days you rolled your own, but these times are over for about 99% of all use cases. Additionally, as frameworks get better and better, all users of the sort function in application code get the benefit. (Same goes for C++ code with STL instead of creating your own b-tree class etc.)
dsimcha about 14 years

+1, but if you're not targeting DS9K-quality compilers it may be more readable to use tail recursion for the longer subarray instead of iteration, and let the compiler optimize the tail recursion.