What's the best way to sum all values in a Pandas dataframe?
52,039
Solution 1
Updated for Pandas 0.24+
df.to_numpy().sum()
Prior to Pandas 0.24+
df.values
Is the underlying numpy array
df.values.sum()
Is the numpy sum method and is faster
Solution 2
Adding some numbers to support this:
import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))
def pandas_test():
return df['a'].sum()
def numpy_test():
return df['a'].to_numpy().sum()
timeit.timeit(numpy_test, number=1000) # 0.5032469799989485
timeit.timeit(pandas_test, number=1000) # 0.6035906639990571
So we get a 20% performance on my machine just for Series summations!
Related videos on Youtube
Author by
Bill
My goal is to identify and lead initiatives to transform industrial operations that consume energy and materials, increasing productivity and reducing environmental impacts. I am interested in: management control and reporting data-driven decision-making organizational learning data analytics industrial process control and optimization coaching and training high-level programming languages (e.g. Python).
Updated on July 09, 2022Comments
-
Bill almost 2 years
I figured out these two methods. Is there a better one?
>>> import pandas as pd >>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]}) >>> print df.sum().sum() 42 >>> print df.values.sum() 42
Just want to make sure I'm not missing something more obvious.
-
Ramon Crehuet over 5 yearsBe careful, because if there are
nan
valuesdf.sum().sum()
ignores thenan
and returns afloat
whereasdf.values.sum()
returnsnan
. So the 2 methods are not equivalent.
-
-
Bill almost 8 yearsThanks. That's what I thought!
-
kuanb about 7 yearsIs it faster purely because one function calls the other or is there some more fundamental difference?
-
piRSquared about 7 years@kuanb two reasons. One,
df.values.sum()
is anumpy
operation and most of the time,numpy
is more performant. Two,numpy
sums over all elements in an array regardless of dimensionality.pandas
requires two separate calls tosum
one for each dimension. -
Bill almost 4 yearsBut is
df['a'].sum()
the same asdf['a'].to_numpy().sum()
? I thinkdf['a'].sum()
only sums the columns doesn't it? -
Raven almost 4 yearsyeah, this is just comparison for a sigle series smmation, I wasn't summing the whole df
-
Bill almost 4 yearsOh I see. But this question is about summing the whole dataframe, not one series.
-
Bill almost 4 yearsCan you report your pandas and numpy versions? I get a much bigger speed difference on your tests with Pandas 0.24.2 and Numpy 1.16.2.