How big is the performance gap between std::sort and std::stable_sort in practice?

c++ sorting stl

22,155

Solution 1

There are good answers that compared the algorithms theoretically. I benchmarked std::sort and std::stable_sort with google/benchmark for curiosity's sake.

It is useful to point out ahead of time that;

Benchmark machine has 1 X 2500 MHz CPU and 1 GB RAM
Benchmark OS Arch Linux 2015.08 x86-64
Benchmark compiled with g++ 5.3.0 and clang++ 3.7.0 (-std=c++11, -O3 and -pthread)
BM_Base* benchmark tries to measure the time populating std::vector<>. That time should be subtracted from the sorting results for better comparison.

First benchmark sorts std::vector<int> with 512k size.

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:37:43
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              24730499   24726189         28
BM_BaseInt/512k_stddev              293107     310668          0
...
BM_SortInt/512k_mean              70967679   70799990         10
BM_SortInt/512k_stddev             1300811    1301295          0
...
BM_StableSortInt/512k_mean        73487904   73481467          9
BM_StableSortInt/512k_stddev        979966     925172          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:39:07
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              26198558   26197526         27
BM_BaseInt/512k_stddev              320971     348314          0
...
BM_SortInt/512k_mean              70648019   70666660         10
BM_SortInt/512k_stddev             2030727    2033062          0
...
BM_StableSortInt/512k_mean        82004375   81999989          9
BM_StableSortInt/512k_stddev        197309     181453          0

Second benchmark sorts std::vector<S> with 512k size (sizeof(Struct S) = 20).

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:49:32
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           26485063   26410254         26
BM_BaseStruct/512k_stddev           270355     128200          0
...
BM_SortStruct/512k_mean           81844178   81833325          8
BM_SortStruct/512k_stddev           240868     204088          0
...
BM_StableSortStruct/512k_mean    106945879  106857114          7
BM_StableSortStruct/512k_stddev   10446119   10341548          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:53:01
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           27327329   27280000         25
BM_BaseStruct/512k_stddev           488318     333059          0 
...
BM_SortStruct/512k_mean           78611207   78407400          9
BM_SortStruct/512k_stddev           690207     372230          0 
...
BM_StableSortStruct/512k_mean    109477231  109333325          8
BM_StableSortStruct/512k_stddev   11697084   11506626          0

Anyone who likes to run the benchmark, here is the code,

#include <vector>
#include <random>
#include <algorithm>

#include "benchmark/benchmark_api.h"

#define SIZE 1024 << 9

static void BM_BaseInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }
  }
}
BENCHMARK(BM_BaseInt)->Arg(SIZE);

static void BM_SortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_SortInt)->Arg(SIZE);

static void BM_StableSortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::stable_sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_StableSortInt)->Arg(SIZE);


struct S {
  int key;
  int arr[4];
};

static void BM_BaseStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }
  }
}
BENCHMARK(BM_BaseStruct)->Arg(SIZE);

static void BM_SortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::sort(v.begin(), v.end(),
              [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_SortStruct)->Arg(SIZE);

static void BM_StableSortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::stable_sort(v.begin(), v.end(),
                     [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_StableSortStruct)->Arg(SIZE);


BENCHMARK_MAIN();

Solution 2

std::stable_sort performs NlogN comparisons when sufficient memory is available. When insufficient memory is available, it degrades to N((logN)^2) comparisons. Therefore it is roughly of the same efficiency as std::sort (which performs O(NlogN) comparisons in both average and worst case) when memory is available.

For those interested, sort() uses an introsort (quicksort which switches to heapsort when the recursion reaches a certain depth) and stable_sort() uses a merge sort.

Solution 3

Big enough to warrant a separate function that does stable sort and not have std::sort() do it transparently.

Solution 4

Sometimes std::stable_sort() is needed because it maintains order of elements that are equal.

And the conventional advice is that, if maintaining order is not important, you should use std::sort() instead.

However, its context dependant. There is plenty of data that is best sorted with stable sort even if you don't need to maintain order:

Quicksort quickly becomes worst-case performance if the data has consistently poor pivot points.

The Burrows-Wheeler Transform is an algorithm used as part of data compression, e.g. bzip2. It requires sorting all the rotations of the text. For the majority of text data, merge sort (as often used by std::stable_sort()) is substantially faster than quicksort (as usually used by std::sort()).

bbb is a BWT implementation that notes the advantages of std::stable_sort() over sort() for this application.

Solution 5

Was looking for something similar - but was surprised no one talked about Auxiliary space.

As I believe - the implementation of both stable_sort and sort is supposed to guarantee O(NlogN) for all (Best, Average & Worst) cases.

However, the difference exists in the Auxiliary space used. stable_sort needs an Auxiliary space of O(N).

May be the difference in performance lies in acquiring that space. :)
Otherwise, theoretically - they are supposed to be same w.r.t performance.

sort should do what you need unless you need this -> stable_sort preserves the relative order of the elements with equivalent values.

View more solutions

22,155

SebastianK

Interested in algorithms, operations research, scheduling, C++

Updated on July 09, 2022

Comments

SebastianK almost 2 years

Both should run in O(n log n), but in general sort is faster than stable_sort. How big is the performance gap in practice? Do you have some experience about that?

I want to sort a very large number of structs that have a size of about 20 bytes. The stability of the result would be nice in my case, but it is not a must. At the moment the underlying container is a plain array, perhaps it could be changed to a std::deque later on.
- Justin Meiners almost 3 years
  
  Note that std::sort is typically implemented as quicksort, and std::stable_sort is a sophisticated form of merge sort.
- wcochran about 2 years
  
  @JustinMeiners quicksort is O(n^2) worst case thus std::sort uses a hybrid that incorporates heapsort when quicksort recurses too deep.
- Justin Meiners about 2 years
  
  @wcochran good point, I am greatly oversimplifying. The STL merge sort has some different cases like that as well: justinmeiners.github.io/efficient-programming-with-component‌s/…
MSalters about 15 years

LogN^2 ususually does not mean log(N^2) but (log N)^2, and especially so if it occurs in big-O notation. This commonly happens with algorithms that take O(N log N) steps with O(log N) work per step and vice versa. So, no, it's not a constant 2.
andrewmu about 15 years

Basically what I am trying to say is try both ways.
Peter M almost 11 years

HAHAH. Okay, so maybe not the most technically detailed answer, but certainly my favorite.
Admin over 10 years

You say std::sort (which performs O(NlogN) comparisons in both average and worst case). But the link to std::sort says: O(N·log(N)) in all cases only since C++11. Before C++11 there was no requiment O(Nlog(N)) for the worst case.
Clearer about 9 years

While this is probably too small on modern computers, when you're sorting very large data sets, you're better off using an external sorting algorithm which makes good use of disk to store intermediate results; such as external merge sort.
Erik Aronesty over 8 years

(and sometimes sorting pointers is a lot faster than sorting structs - and vice versa, depending on the size of the structs, the available RAM, etc)
5gon12eder over 8 years

+1 for actually going ahead and doing the benchmark. Your work looks sound. The results would be more accessible, however, if you'd summarize them in an easier to digest form like drawing a small diagram.