Python sum, why not strings?

59,810

Solution 1

Python tries to discourage you from "summing" strings. You're supposed to join them:

"".join(list_of_strings)

It's a lot faster, and uses much less memory.

A quick benchmark:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.

BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

Solution 2

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'

Solution 3

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

In the builtin_sum function we have this bit of code:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

So.. that's your answer.

It's explicitly checked in the code and rejected.

Solution 4

From the docs:

The preferred, fast way to concatenate a sequence of strings is by calling ''.join(sequence).

By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

Solution 5

Short answer: Efficiency.

Long answer: The sum function has to create an object for each partial sum.

Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.

doubles are always the same size, which makes sum's running time O(1)×N = O(N).

int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).

For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).

Thus, summing strings would be much slower than summing numbers.

str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

Share:
59,810
Muhammad Alkarouri
Author by

Muhammad Alkarouri

-

Updated on July 08, 2022

Comments

  • Muhammad Alkarouri
    Muhammad Alkarouri almost 2 years

    Python has a built in function sum, which is effectively equivalent to:

    def sum2(iterable, start=0):
        return start + reduce(operator.add, iterable)
    

    for all types of parameters except strings. It works for numbers and lists, for example:

     sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
     sum({888:1}, 0) = sum2({888:1},0) = 888
    

    Why were strings specially left out?

     sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
     sum2(['foo','bar'], '') = 'foobar'
    

    I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.

    Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.

    Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?