Confidence Interval in Python dataframe
25,819
Update on 25-Oct-2021: @a-donda pointed out, 95% shall be based on 1.96 X standard deviations of the mean.
import pandas as pd
import numpy as np
import math
df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'],
'Force': [50,150,100,120,140,160] },
columns=['Class', 'Force'])
print(df)
print('-'*30)
stats = df.groupby(['Class'])['Force'].agg(['mean', 'count', 'std'])
print(stats)
print('-'*30)
ci95_hi = []
ci95_lo = []
for i in stats.index:
m, c, s = stats.loc[i]
ci95_hi.append(m + 1.96*s/math.sqrt(c))
ci95_lo.append(m - 1.96*s/math.sqrt(c))
stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
print(stats)
The output is
Class Force
0 A1 50
1 A1 150
2 A1 100
3 A2 120
4 A3 140
5 A3 160
------------------------------
mean count std
Class
A1 100 3 50.000000
A2 120 1 NaN
A3 150 2 14.142136
------------------------------
mean count std ci95_hi ci95_lo
Class
A1 100 3 50.000000 156.580326 43.419674
A2 120 1 NaN NaN NaN
A3 150 2 14.142136 169.600000 130.400000
Related videos on Youtube
Author by
MasterShifu
Updated on October 25, 2021Comments
-
MasterShifu over 2 years
I am trying to calculate the mean and confidence interval(95%) of a column "Force" in a large dataset. I need the result by using the groupby function by grouping different "Classes".
When I calculate the mean and put it in the new dataframe, it gives me NaN values for all rows. I'm not sure if I'm going the correct way. Is there any easier way to do this?
This is the sample dataframe:
df=pd.DataFrame({ 'Class': ['A1','A1','A1','A2','A3','A3'], 'Force': [50,150,100,120,140,160] }, columns=['Class', 'Force'])
To calculate the confidence interval, the first step I did was to calculate the mean. This is what I used:
F1_Mean = df.groupby(['Class'])['Force'].mean()
This gave me
NaN
values for all rows. -
autonopy about 3 yearsSuch an outstanding answer. I wish I could award it. Returning a df with the all the stats in it is such a great practice. Well done.
-
autonopy about 3 yearsSo here's my addition: to get it to return a formatted string column, add the following:
stats['95p_ci'] = "(" + stats['ci95_lo'].round(1).astype(str) + ', ' + stats['ci95_hi'].round(1).astype(str) + ')'
-
A. Donda over 2 yearsThe correct multiplier for a 95% CI is 1.96, not 1.95. Also, be aware that this is based on the normal distribution approximation to the binomial distribution, and only works well for large samples.
-
yoonghm over 2 years@A.Donda, you are correct. Let me update