Fast sort algorithms for arrays with mostly duplicated elements?

11,250

Solution 1

In practice, you can first iterate through the array once and use a hash table the count the number of occurrences of the individual elements (this is O(n) where n = size of the list). Then take all the unique elements and sort them (this is O(k log k) where k = number of unique elements), and then expand this back to a list of n elements in O(n) steps, recovering the counts from the hash table. If k << n you save time.

Solution 2

I would try Counting sort with some mapping function. Ie. you wont use the frequencies array of size equal to the range of elements, instead you would iterate over the array, write down distinct elements and use them in a mapping function to the array of frequencies.

This way the algorithm has only one extra iteration and a mapping function, which should work in a constant time (using some kind of hash table). The complexity of this approach would be O(n), which should be optimal.

Solution 3

Not the best algorithm, but simple:
You can put everything in a trie, and have the leaves be counters. That should take O(n*m) where n is the number of elements and m is the size of the biggest element (typically that would be a constant, but not necessarily). Then pre-order traverse the tie, outputting counter elements of the current key when you hit a leaf. That should take only O(n+p) where p is the size of the trie, which should be tiny compared to n.

Solution 4

Implementation in C++ based on algo as suggested by @Antti Huima

  • Count frequencies and store in hashtable.
  • sort elements in hashtable.
  • overwrite input array with sorted elements depending on frequencies.
#include <unordered_map>
#include <map>
// Modifies input array to a sorted array
// Complexity: O(n+(k*log(k))) where 'k' = number of unique elements input array
template <typename Datatype>
void SortArrayWithDuplicates(std::vector<Datatype>& in_seq) {
  std::unordered_map<Datatype, int> key_counts_map;
  // Count freqs O(n)
  for (const auto& itr: in_seq)
      key_counts_map[itr] += 1;

  // Sort elements by inserting into a map O(k*log(k))
  std::map<Datatype, int> key_counts_sorted_map;
  for (auto const& itr: key_counts_map)
      key_counts_sorted_map.insert(std::make_pair(itr.first, itr.second));

  auto AlwaysTrue = [](Datatype i)->bool{return true;};
  auto seq_itr = std::begin(in_seq);
  // Update input sequence with new sorted values
  for (auto const& itr: key_counts_sorted_map) {
      std::replace_if(seq_itr, seq_itr+itr.second, AlwaysTrue, itr.first);
      seq_itr += itr.second;
  }
}
Share:
11,250
donnyton
Author by

donnyton

Updated on June 05, 2022

Comments

  • donnyton
    donnyton almost 2 years

    What are efficient ways to sort arrays that have mostly a small set of duplicated elements? That is, a list like:

    { 10, 10, 55, 10, 999, 8851243, 10, 55, 55, 55, 10, 999, 8851243, 10 }

    Assuming that the order of equal elements doesn't matter, what are good worst-case/average-case algorithms?