Can I build a list, and sort it at the same time?

12,221

Solution 1

The short answer is: it's not worth it.

Have a look at insertion sort. The worst-case running time is O(n^2) (average case is also quadratic). On the other hand, Python's sort (also known as Timsort) will take O(n log n) in the worst case.

Yes, it does "seem" cleaner to keep the list sorted as you're inserting, but that's a fallacy. There is no real benefit to it. The only time you'd consider using insertion sort is when you need to show the sorted list after every insertion.

Solution 2

The two approaches are asmptotically equivalent.

Sorting is O(n lg n) (Python uses Timsort by default, except for very small arrays), and inserting in a sorted list is O(lg n) (using binary search), which you would have to do n times.

In practice, one method or the other may be slightly faster, depending on how much of your data is already sorted.

EDIT: I assumed that inserting in the middle of a sorted list after you've found the insertion point would be constant time (i.e. the list behaved like a linked list, which is the data structure you would use for such an algorithm). This probably isn't the case with Python lists, as pointed out by Sven. This would make the "keep the list sorted" approach O(n^2), i.e. insertion sort.

I say "probably" because some list implementations switch from array to linked list as the list grows, the most notable example being CFArray/NSArray in CoreFoundation/Cocoa. This may or may not be the case with Python.

Solution 3

Take a look at the bisect module. It gives you various tools for maintaining a list order. In your case, you probably want to use bisect.insort.

for item in query_for_stuff():
    bisect.insort( my_list, "query for %s data" % item )
Share:
12,221

Related videos on Youtube

Simon Lundberg
Author by

Simon Lundberg

Updated on June 04, 2022

Comments

  • Simon Lundberg
    Simon Lundberg almost 2 years

    I'm working on a script for a piece of software, and it doesn't really give me direct access to the data I need. Instead, I need to ask for each piece of information I need, and build a list of the data I'm getting. For various reasons, I need the list to be sorted. It's very easy to just build the list once, and then sort it, followed by doing stuff with it. However, I assume it would be faster to run through everything once, rather than build the list and then sort it.

    So, at the moment I've basically got this:

    my_list = []
    
    for item in "query for stuff":
        my_list.append("query for %s data" % item)
    
    my_list.sort()
    
    do_stuff(my_list)
    

    The "query for stuff" bit is the query interface with the software, which will give me an iterable. my_list needs to contain a list of data from the contents of said iterable. By doing it like this, I'm querying for the first list, then looping over it to extract the data and put it into my_list. Then I'm sorting it. Lastly, I'm doing stuff to it with the do_stuff() method, which will loop over it and do stuff to each item.

    The problem is that I can't do_stuff() to it before it's sorted, as the list order is important for various reasons. I don't think I can get away from having to loop over lists twice — once to build the list and once to do stuff to each item in it, as we won't know in advance if a recently added item at position N will stay at position N after we've added the next item — but it seems cleaner to insert each item in a sorted fashion, rather than just appending them at the end. Kind of like this:

    for item in "query for stuff":
        my_list.append_sorted(item)
    

    Is it worth bothering trying to do it like this, or should I just stick to building the list, and then sorting it?

    Thanks!

  • Sven Marnach
    Sven Marnach over 12 years
    Inserting in a sorted (Python) list is O(n), not O(log n). Python lists are stored as arrays.
  • Sven Marnach
    Sven Marnach over 12 years
    How would you binary search a linked list? The data structure you would use is some kind of balanced tree.
  • Admin
    Admin over 12 years
    There is no such optimization in any Python implementation I'm aware of. CPython's list implementation is always a C array, i.e. a tightly-packed piece of continuous piece of memory. And the (surprisingly numerous) clever tricks of PyPy's data structure implementations are usually documented (not guaranteed, but at least mentioned) and nothing like this is among them. I can't speka for Jython and IronPython, but I figure far less optimization effort went into them, and for them it's much easier to just wrap the underlying platform's dynamic array class (ArrayList or whatever).
  • Simon Lundberg
    Simon Lundberg over 12 years
    So the problem with keeping it sorted, is basically that each time you insert something at position N, any entries in the list at position>N will have to be shifted +1 — right? Whereas if you build the list first, you will only need to move entries around during the one .sort() operation at the end.
  • Can Berk Güder
    Can Berk Güder over 12 years
    @SvenMarnach it doesn't have to be a linked list (I said "behaved like a linked list"), it just has to support insertions in at most logarithmic time. it could be a binary tree, possibly skip list, etc.
  • Can Berk Güder
    Can Berk Güder over 12 years
    @SimonLundberg if the list is implemented as an array, yes, the problem is that everything needs to be shifted. with a plain linked list, finding the insertion point takes linear time, as Sven pointed out. other data structures are available, but the gist is that no solution can be faster than O(n lg n), which is what you get with sorting afterwards.
  • patapouf_ai
    patapouf_ai over 6 years
    This is false. Inserting an element into a sorted list is O(log(n)) if you do it properly. And if you need a sorted list between each insert, than it is much more efficient to keep a sorted list.
  • mpenkov
    mpenkov over 6 years
    You're thinking of an abstract list. Python lists are implemented as arrays. This means insertion costs O(n) in the average case, regardless of where you insert. See wiki.python.org/moin/TimeComplexity.