How can I sort an iterator without putting it all in a vector?

15,023

It is literally impossible to sort a set of values without having all of the data. For example, if the iterator had 1 billion instances of 1 followed by a single 0, you simply wouldn't know that the zero needed to go first until you got there. You may wish to re-acquaint yourself with the concept of on- and offline algorithms.

without putting it all in a vector

That's easy: don't use a vector, use any type that implements FromIterator. For example, you could collect into a BinaryHeap:

use std::{collections::BinaryHeap, iter};

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));
    let data: BinaryHeap<_> = a_lot_of_numbers.collect();
}

Whether this is a good idea or not depends entirely on your case.

If you just don't want to see the vector or just wish to preserve the chaining, then I'd suggest using Itertools::sorted. This uses a Vec internally, meaning that all of the data is stored in memory before the first value is returned:

use itertools::Itertools; // 0.8.0
use std::iter;

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));

    for v in a_lot_of_numbers.sorted() {
        println!("{}", v);
    }
}

This is a common problem with databases, where is not wise to load all the data then sort

Databases are amazingly complex pieces of software with years of effort put into them considering carefully weighed tradeoffs. You won't find that grade of algorithm lying about in a package manager. Even if you could, databases don't always get it right, requiring skilled programmers to tweak queries to perform better. All you need to know about sorting in Postgres covers a good set of what Postgres can do.

It should be theoretically possible to write an iterator adapter that writes all of the data to disk, performs sorting there, then re-reads the data from the disk. This is called external sorting.

Share:
15,023
mamcx
Author by

mamcx

CEO &amp; Developer of "El malabarista", maker of the POS for the iPhone BestSeller. 12+ years of experience creating software in use for more than +2000 users in my country. Experience in Python, Django, Web apps, APIs, .NET, Objective-C, iPhone/iPad development, RemObjects &amp; more. CV: http://careers.stackoverflow.com/cv/employer/20796

Updated on June 13, 2022

Comments

  • mamcx
    mamcx almost 2 years

    I'm building a generic interface similar to generators that stream data from a stream to another, to eventually do things like:

    file |> toCsv |> filter |> sort |> filter...
    

    I know how to sort a vector/slice, but how can I sort from an incoming stream/iterator without putting it all in a vector?

    stream.iter().collect_sorted()
    

    I need to fuse vectors, trees, files, databases, etc., so sometimes I don't know how large the incoming data is without consuming it all.

    I'm not against storing the results. The problem is that sorting is tied to slices/vector. I need to be able to do:

    datasource |> Algo.sort |> next...
    

    instead of:

    let data = datasource |> into_vec
    data.sort()
    data |> next...
    

    Different sorting algorithms exist for different use cases, so eventually I wish to apply the best for the data at hand:

    datasource |> Algo.MergeSort |> next...
    datasource |> Algo.BubbleSort |> next...
    
  • Admin
    Admin about 5 years
    I think the key to OP's problem will be FromIterator and IntoIterator - then a module in the chain can simply accept iterators and output iterators. Hard without seeing how OP has implemented things so far though.
  • Shepmaster
    Shepmaster about 5 years
    @swalladge Indeed. If OP is just using method chaining, then something closer to Itertools::sorted will be relevant.
  • mamcx
    mamcx about 5 years
    @ Shepmaster Looking how itertools do it is enough for me. I need to adapt to a streaming interface I'm doing. With relation to databases, I'm building a relational lang (thing like python but everything is a relation) so everything is in-memory, only need to worry for loading from external sources and implement a few core collections.