How external merge sort algorithm works?

algorithm sorting mergesort external-sorting

47,809

Solution 1

I guess after such a long time you must have got an answer. But I am still providing some example links to help someone else who hits this question.

NOTE: Before looking into this link you should have an idea about Heap data structure Take a look at Example of Two-Way Sorting and Example of multiway external sorting and you will get a complete idea of the implementation of a external sorting algorithm

Solution 2

First of all, by sorting the numbers in parts of 4 numbers, you should get 3 chunks.

A:[1,2,3,5]  
B:[4,6,8,9]
C:[7]

Then you will read half of each file (ignore C since it won't fit) and merge them. So, you will load into memory {[1, 2], [4, 6]}. You will do a casual merge and write the result in a new chunk D:

Compare 1 and 4 -> D:[1]
Compare 2 and 4 -> D:[1, 2]

Now the part of A that was in RAM finished merging, so now you will have to bring the second half of it in memory. Now your memory will have {[3, 5], [4, 6]}.

Compare 3 and 4 -> D:[1, 2, 3]
Compare 5 and 4 -> D:[1, 2, 3, 4]
Compare 5 and 6 -> D:[1, 2, 3, 4, 5]

All of chunk A got merged, so now just append the rest of B into D

D:[1,2,3,4,5,6,8,9]

Now you would have to do the same process with chunks C and D. Remember that C could have more than one number in another example. By merging C and D you will get a new chunk E that will be the final sorted file.

Also, note that in a bigger example you might need more merge phases. For example, if you had 20 numbers to sort, You would create 5 chunks of 4 numbers, and then you would combine and merge two of them each time, resulting in 2 chunks of 8 numbers (plus one extra of 4 numbers), and then merge the newer chunks into one of 16 numbers and so on.

Solution 3

You'll iterate through the files at the same time.

Just start from the beginning of each file and keep picking whichever file's element is not greater (i.e. smaller or equal) than the other, output that element to the new file and increase the iterator.

From your last statement, it's unclear whether or not you already know to do this, but this is all you need to do, because:

You'd only need to have one number in memory for each of the files, and of course any indices and other variables that are presumably ignored for the purpose of this exercise.
You only need to read each file once, as you can keep the files open at the correct position during this process so you don't need to read the whole file again to get to the correct position.

So, for:

A:[1,2,3,5]
B:[4,6,8,9]

You'd start off with the first element from each file - 1 and 4.

The 1 is smaller, so you output that to the new file and move on to 2.

2 is smaller than 4, so you output that and move on to 3.

3 is smaller than 4, so you output that and move on to 5.

4 is smaller than 5, so you output that and move on to 6.

5 is smaller than 6, so you output that and then you've reached the end of A.

Now just output the rest of B: 6, 8, 9.

This gives you [1,2,3,4,5,6,8,9].

Solution 4

External sorting is usually used when you need to sort files that are too large to fit into memory.

The trick is to break the larger input file into k sorted smaller chunks and then merge the chunks into a larger sorted file. For the merge use a min heap. k will depend on your memory threshold.

Read a certain number of records (depending on your memory threshold) from each chunk and put it in a queue per chunk.

Pop the leftmost item (This will be the smallest item as the items in the queue will be sorted) from each queue and push it to the heap

Pop the min item from the heap. Note what queue it came from

Replenish the queue with the next item from it's corresponding chunk that is not in the queue

Pop the left most item from the queue and push it to the heap

Write the min item to the output file

Continue the above 4 steps till the heap is empty

Sample python code (Does not merge in place)

import os
import heapq
import itertools
import linecache
from collections import deque
import sys


def external_sort(input_directory, input_file_name, output_file_name):
    with open(os.path.expanduser(input_directory + '/' + output_file_name), 'w+') as f:
        heap = []
        pages = {}
        next_line_numbers = {}
        has_more_items = {}
        chunk_file_paths, max_chunk_size = create_sorted_chunks(input_directory, input_file_name)
        max_page_size = max_chunk_size // 10
        for chunk_file_path in chunk_file_paths:
            pages[chunk_file_path] = populate_page(chunk_file_path, max_page_size)
            next_line_numbers[chunk_file_path] = len(pages[chunk_file_path])
            has_more_items[chunk_file_path] = True
        for chunk_file_path in chunk_file_paths:
            heapq.heappush(heap, pages[chunk_file_path].popleft())
        while heap:
            item, chunk_file_path = heapq.heappop(heap)
            f.write(str(item)+'\n')
            if has_more_items[chunk_file_path]:
                has_more_items[chunk_file_path] = append_next(pages, chunk_file_path, next_line_numbers[chunk_file_path])
                next_line_numbers[chunk_file_path] += 1
            if pages[chunk_file_path]:
                heapq.heappush(heap, pages[chunk_file_path].popleft())
    for chunk_file_path in chunk_file_paths:
        os.remove(chunk_file_path)


def populate_page(chunk_file_path, max_page_size):
    chunk = deque()
    with open(chunk_file_path, 'r') as f:
        for line in itertools.islice(f, 0, max_page_size):
            chunk.append((int(line), chunk_file_path))
    return chunk


def append_next(chunks, chunk_file_path, line_number):
    chunk = chunks[chunk_file_path]
    item = linecache.getline(chunk_file_path, line_number)
    if item and len(item) > 0:
        chunk.append((int(item), chunk_file_path))
        has_more = True
    else:
        has_more = False
    return has_more


def create_sorted_chunks(input_file_directory, input_file_name):
    input_file_path = os.path.expanduser(input_file_directory + '/' + input_file_name)
    suffix = 1
    begin, end, tot = 0, 0, 0
    chunk_file_paths = []
    with open(input_file_path, 'r') as f:
        for line in f.readlines():
            tot += 1
    end = tot//10
    while suffix <= 10:
        buffer = []
        chunk_file_name = 'temp' + str(suffix) + '.txt'
        chunk_file_path = os.path.expanduser(input_file_directory + '/' + chunk_file_name)
        if not os.path.isfile(chunk_file_path):
            with open(os.path.expanduser(input_file_path), 'r') as f:
                for line in itertools.islice(f, begin, end):
                    buffer.append(int(line))
                create_chunk(chunk_file_path, buffer)
        suffix += 1
        begin = end
        end += tot//10
        chunk_file_paths.append(chunk_file_path)
    return chunk_file_paths, tot//10


def create_chunk(chunk_file_path, buffer):
    buffer.sort()
    with open(chunk_file_path, 'w+') as f:
        for i in buffer:
            f.write(str(i) + '\n')


if __name__ == '__main__':
    external_sort(sys.argv[1], sys.argv[2], sys.argv[3])

Solution 5

Please read the README file to properly understand external merge sort.

There is step by step implementation defined

https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md

View more solutions

47,809

KolKir

Updated on January 11, 2021

Comments

KolKir over 3 years
I'm trying to understand how external merge sort algorithm works (I saw some answers for same question, but didn't find what I need). I'm reading the book "Analysis Of Algorithms" by Jeffrey McConnell and I'm trying to implement the algorithm described there.

For example, I have input data: 3,5,1,2,4,6,9,8,7, and I can load only 4 numbers into memory.

My first step is read the input file in 4-number chunks, sort them in memory and write one to file A and next to file B.

I got:
```
A:[1,2,3,5][7]  
B:[4,6,8,9]
```
Now my question how can I merge chunks from these files to the bigger ones if they will not fit to the memory? Jeffrey McConnell wrote that I need to read half chunks and merge them to next files C and D.

But I got wrong sequence:
```
C:[1,2,4,6,3,8,5,9]
D:[7]
```
Can anyone provide an example with step by step instruction, please?

PS: I understand how to merge number by number by reading from file, but how do I do it with in-memory buffers to reduce I/O operations?
- Ilmari Karonen over 10 years
  
  It seems to me you're saying you already do understand everything important. Your last question sounds like you're asking how to use / implement buffered I/O, which really has nothing to do specifically with merge sort.
- KolKir over 10 years
  
  my question is mostly about algorithm from book i wrote above, this book tells that i need to go with reads of halves of 4 numbers runs from two files and merge them to next file. As I understand it moment is about buffered operations.
- Bernhard Barker over 10 years
  
  What's going on here - A:[1,2,3,5][7]? Isn't that 7 supposed to be separate - C:[7]?
nonbeing over 9 years

Those links are excellent - finally understood external sort with those examples. Thanks.
gabereal about 8 years

This should be the accepted answer. Fits guidelines much better!
Anatolii Stepaniuk almost 7 years

Can't we just create N temp files (for each sorted run) and them merge their elements one by one in a sorted order (by creating another tmp file and deleting 2 previous tmp files)? That looks easier for me than two-way sorting and multiway sorting proposed in the links above.
mcfroob almost 6 years

@AnatoliiStepaniuk Theoretically yes, but keep in mind that reading/writing files is a lot slower than reading/writing from memory. If you had, say, 20GB of strings, and you wrote every string into its own file, that would be a lot slower than splitting the file up into, say, 100MB chunks.
stwykd almost 4 years

If A, B, or C don't fit into memory, then the outcome D would not fit into memory. I assume it will be saved to disk before completing the merge. When is D saved to disk?
Savvas Parastatidis almost 4 years

@stwykd The same way you manipulate A, B, and C. D is a file on disk and you keep track of its"cursor", and any time you want to put data into it, you append it in the end of the file, which is referenced by "cursor".