What's the best way to sum all values in a Pandas dataframe?

52,039

Solution 1

Updated for Pandas 0.24+

df.to_numpy().sum()

Prior to Pandas 0.24+

df.values

Is the underlying numpy array

df.values.sum()

Is the numpy sum method and is faster

Solution 2

Adding some numbers to support this:

import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))

def pandas_test():
    return df['a'].sum()

def numpy_test():
    return df['a'].to_numpy().sum()

timeit.timeit(numpy_test, number=1000)  # 0.5032469799989485
timeit.timeit(pandas_test, number=1000)  # 0.6035906639990571

So we get a 20% performance on my machine just for Series summations!

Share:
52,039

Related videos on Youtube

Bill
Author by

Bill

My goal is to identify and lead initiatives to transform industrial operations that consume energy and materials, increasing productivity and reducing environmental impacts. I am interested in: management control and reporting data-driven decision-making organizational learning data analytics industrial process control and optimization coaching and training high-level programming languages (e.g. Python).

Updated on July 09, 2022

Comments

  • Bill
    Bill almost 2 years

    I figured out these two methods. Is there a better one?

    >>> import pandas as pd
    >>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
    >>> print df.sum().sum()
    42
    >>> print df.values.sum()
    42
    

    Just want to make sure I'm not missing something more obvious.

    • Ramon Crehuet
      Ramon Crehuet over 5 years
      Be careful, because if there are nan values df.sum().sum() ignores the nan and returns a float whereas df.values.sum() returns nan. So the 2 methods are not equivalent.
  • Bill
    Bill almost 8 years
    Thanks. That's what I thought!
  • kuanb
    kuanb about 7 years
    Is it faster purely because one function calls the other or is there some more fundamental difference?
  • piRSquared
    piRSquared about 7 years
    @kuanb two reasons. One, df.values.sum() is a numpy operation and most of the time, numpy is more performant. Two, numpy sums over all elements in an array regardless of dimensionality. pandas requires two separate calls to sum one for each dimension.
  • Bill
    Bill almost 4 years
    But is df['a'].sum() the same as df['a'].to_numpy().sum()? I think df['a'].sum() only sums the columns doesn't it?
  • Raven
    Raven almost 4 years
    yeah, this is just comparison for a sigle series smmation, I wasn't summing the whole df
  • Bill
    Bill almost 4 years
    Oh I see. But this question is about summing the whole dataframe, not one series.
  • Bill
    Bill almost 4 years
    Can you report your pandas and numpy versions? I get a much bigger speed difference on your tests with Pandas 0.24.2 and Numpy 1.16.2.