Pandas dataframe groupby to calculate population standard deviation

33,856

Solution 1

You can pass additional args to np.std in the agg function:

In [202]:

df.groupby('A').agg(np.std, ddof=0)

Out[202]:
     B  values
A             
1  0.5     2.5
2  0.5     2.5

In [203]:

df.groupby('A').agg(np.std, ddof=1)

Out[203]:
          B    values
A                    
1  0.707107  3.535534
2  0.707107  3.535534

Solution 2

For degree of freedom = 0

(This means that bins with one number will end up with std=0 instead of NaN)

import numpy as np


def std(x): 
    return np.std(x)


df.groupby('A').agg(['mean', 'max', std])
Share:
33,856
neelshiv
Author by

neelshiv

Updated on May 02, 2020

Comments

  • neelshiv
    neelshiv about 4 years

    I am trying to use groupby and np.std to calculate a standard deviation, but it seems to be calculating a sample standard deviation (with a degrees of freedom equal to 1).

    Here is a sample.

    #create dataframe
    >>> df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
    >>> df
       A  B  values
    0  1  1      10
    1  1  2      15
    2  2  1      20
    3  2  2      25
    
    #calculate standard deviation using groupby
    >>> df.groupby('A').agg(np.std)
          B    values
    A                    
    1  0.707107  3.535534
    2  0.707107  3.535534
    
    #Calculate using numpy (np.std)
    >>> np.std([10,15],ddof=0)
    2.5
    >>> np.std([10,15],ddof=1)
    3.5355339059327378
    

    Is there a way to use the population std calculation (ddof=0) with the groupby statement? The records I am using are not (not the example table above) are not samples, so I am only interested in population std deviations.

  • neelshiv
    neelshiv over 9 years
    Thank you! I had tried "df.groupby('A').agg(np.std(ddof=0))", but I did not try adding the ddof in the agg parenthesis. I'll mark your reply as the answer once I can in 8 minutes (you responded really quickly).