how to optimally count elements in a python list

10,318

Solution 1

Change where you sort for a savings of about 20%.

Change this:

def dupli(the_list):
    return [(item, the_list.count(item)) for item in sorted(set(the_list))]

To this:

def dupli(the_list):
    count = the_list.count # this optimization added courtesy of Sven's comment
    result = [(item, count(item)) for item in set(the_list)]
    result.sort()
    return result

The reason this is faster is that the sorted iterator must create a temporary list, whereas sorting the result sorts in place.

edit: Here's another approach that is 35% faster than your original:

def dupli(the_list):
    counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    for n in the_list:
        counts[n] += 1
    return [(i, counts[i]) for i in (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) if counts[i]]

Note: You may want to randomize the values for the_list. My final version of dupli tests even faster with other random data sets (import random; the_list=[random.randint(0,12) for i in xrange(10)])

Solution 2

I would try:

from collections import defaultdict
output = defaultdict(lambda: 0)
for item in the_list: output[item] += 1
return sorted(output.items())

Solution 3

Taking advantage of the qualification "between 0 and 12":

>>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
>>> answer1 = [0] * 13
>>> for i in the_list:
...    answer1[i] += 1
...
>>> answer1
[0, 0, 0, 0, 3, 4, 1, 2, 0, 0, 0, 0, 0]
>>> # You might be able to use that as-is:
...
>>> for i, v in enumerate(answer1):
...     if v: print i, v
...
4 3
5 4
6 1
7 2
>>> # Otherwise you can build the list that you specified:
...
>>> answer2 = [(i, v) for i, v in enumerate(answer1) if v]
>>> answer2
[(4, 3), (5, 4), (6, 1), (7, 2)]
>>>
Share:
10,318
Christophe
Author by

Christophe

Updated on June 18, 2022

Comments

  • Christophe
    Christophe almost 2 years

    This is almost the same question than here, except that I am asking about the most efficient solution for a sorted result.

    I have a list (about 10 integers randomly between 0 and 12), for example:

    the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
    

    I want to create a function that returns a list of tuples (item, counts) ordered by the first element, for example

    output = [(4, 3), (5, 4), (6, 1), (7, 2)]
    

    So far I have used:

    def dupli(the_list):
        return [(item, the_list.count(item)) for item in sorted(set(the_list))]
    

    But I call this function almost a millon time and I need to make it as fast as I (python) can. Therefore my question: How to make this function less time comsuming? (what about memory?)

    I have played around a bit, but nothing obvious came up:

    from timeit import Timer as T
    number=10000
    setup = "the_list=[5, 7, 6, 5, 5, 4, 4, 7, 5, 4]"
    
    stmt = "[(item, the_list.count(item)) for item in sorted(set(the_list))]"
    T(stmt=stmt, setup=setup).timeit(number=number)
    
    Out[230]: 0.058799982070922852
    
    stmt = "L = []; \nfor item in sorted(set(the_list)): \n    L.append((item, the_list.count(item)))"
    T(stmt=stmt, setup=setup).timeit(number=number)
    
    Out[233]: 0.065041065216064453
    
    stmt = "[(item, the_list.count(item)) for item in set(sorted(the_list))]"
    T(stmt=stmt, setup=setup).timeit(number=number)
    
    Out[236]: 0.098351955413818359
    

    Thanks
    Christophe

  • Sven Marnach
    Sven Marnach over 13 years
    On my machine, this takes about twice the time of the dupli() function in the OP.
  • Sven Marnach
    Sven Marnach over 13 years
    Both these functions are significantly slower on my machine than the function in the OP. Additionally, the second function returns a wrong result.
  • Sven Marnach
    Sven Marnach over 13 years
    This is slower than the function in the OP on my machine, though by a smaller margin than all other suggestions so far.
  • Sven Marnach
    Sven Marnach over 13 years
    This is the first thing I tried. On my machine, it is on average the same speed as the original dupli() -- at least if the output is converted to the requested format (answer2).
  • Wolph
    Wolph over 13 years
    @Sven Marnach: that depends, if you use psyco than the dupli2 method is faster over here. You're right about the dupli3 method, I made a stupid mistake there and accidently posted an earlier version here. I'll update it :)
  • Sven Marnach
    Sven Marnach over 13 years
    OK, I used CPython 2.6.6. It gets slightly faster if you move the attribute lookup for .get() out of the loop.
  • Sven Marnach
    Sven Marnach over 13 years
    When timing dupli() and dupli2() you copied the returned list using list(). This seems a bit pointless.
  • John La Rooy
    John La Rooy over 13 years
    you should use setdefault instead of the shenanigans with get
  • Sven Marnach
    Sven Marnach over 13 years
    This is the fastest approach I have seen so far (using plain CPython 2.6.6). It can be slightly improved by looking up .count() outside of the list comprehension (i.e. count = the_list.count before result = ..., and then use (item, count(item)) inside the list comprehension).
  • Steven Rumbalski
    Steven Rumbalski over 13 years
    @Sven Marnach: Nice optimization. I've updated my answer to include it. I've also added another approach which was based on John Machin's answer, but it tests much faster due to the elimination of enumerate and because it expands [0] * 13 to its result.
  • Wolph
    Wolph over 13 years
    @gnibbler: what is the point of using setdefault with an integer? You can't increment it anyway so you'll be doing pretty much the same. Only a defaultdict would make it shorter/better.
  • Wolph
    Wolph over 13 years
    @Sven Marnach: ofcourse, it was the only way to make somewhat of a fair comparison of the 3 methods.
  • John La Rooy
    John La Rooy over 13 years
    it will be faster to use defaultdict(int) instead of using the lambda
  • Steven Rumbalski
    Steven Rumbalski over 13 years
    I like your use of iterators. As far as optimization goes, your solution takes 2.5 times longer than OP's best effort.