Python - Pandas subtotals on groupby
12,711
Use:
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
Author by
onlyf
Updated on June 04, 2022Comments
-
onlyf over 1 year
here's a sample of the data i m using :
SCENARIO DATE POD AREA IDOC STATUS TYPE AAA 02.06.2015 JKJKJKJKJKK 4210 713375 51 1 AAA 02.06.2015 JWERWERE 4210 713375 51 1 AAA 02.06.2015 JAFDFDFDFD 4210 713375 51 9 BBB 02.06.2015 AAAAAAAA 5400 713504 51 43 CCC 05.06.2015 BBBBBBBBBB 4100 756443 51 187 AAA 05.06.2015 EEEEEEEE 4100 756457 53 228
I have written the following code in pandas to groupby:
import pandas as pd import numpy as np xl = pd.ExcelFile("MRD.xlsx") df = xl.parse("Sheet3") #print (df.column.values) # The following gave ValueError: Cannot label index with a null key # dfi = df.pivot('SCENARIO) # Here i do not actually need it to count every column, just a specific one table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count']) writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter') table.to_excel(writer, sheet_name='Sheet1') writer.save() table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()) print (table2) writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter') table2.to_excel(writer2, sheet_name='Sheet1') writer2.save()
this yields a result :
SCENARIO STATUS TYPE TYPE AAA 51 1 2 9 1 53 228 1 BBB 51 43 1 CCC 51 187 1 Name: TYPE, dtype: int64
How could i add subtotals per group? Ideally i would want to achieve something like:
SCENARIO STATUS TYPE TYPE AAA 51 1 2 9 1 Total 3 53 228 1 Total 1 BBB 51 43 1 Total 1 CCC 51 187 1 Total 1 Name: TYPE, dtype: int64
Is this possible?
-
jezrael almost 6 yearsIs problem if
Total
is inTYPE
level?
-
-
onlyf almost 6 yearsTraceback (most recent call last): File "C:\Anaconda3\lib\site-packages\pandas\indexes\multi.py", line 707, in get_value return _index.get_value_at(s, k) File "pandas\index.pyx", line 65, in pandas.index.get_value_at (pandas\index.c:2759) File "pandas\src\util.pxd", line 63, in util.get_value_at (pandas\index.c:16180) TypeError: 'str' object cannot be interpreted as an integer
-
onlyf almost 6 yearsThere's another error in pandas, eventually i get: KeyError: 'SCENARIO'
-
jezrael almost 6 yearsfor first error need
df['TYPE'] = df['TYPE'].astype(str)
and for second usedf1 = df.groupby(level=["SCENARIO", "STATUS"]).sum()
-
onlyf almost 6 yearslast fix, and sorry for the trouble, but in my output the third column, 'TYPE' once more comes out as floating points ie 155.0, is there a way to cast this into integers?
-
jezrael almost 6 yearsIs possible use
df['TYPE'] = df['TYPE'].astype(int).astype(str)
? Or there are someNaN
s ? -
onlyf almost 6 yearsWe have nans, and i get the "Cannot convert NaN to integer" with it, is there a way around that?
-
jezrael almost 6 yearsIs possible remove rows with nans by
df = df.dropna(subset=['TYPE'])
and thendf['TYPE'] = df['TYPE'].astype(int).astype(str)
? -
onlyf almost 6 yearsThis worked, sadly, i d like to keep count of the NaN's as well, not necessarily on the same df, perhaps on a separate one. Thanks for all the help!
-
onlyf almost 6 yearsperhaps if we could replace NaN's with a default value?
-
jezrael almost 6 yearsyes, it ispossible. like
df['TYPE'] = df['TYPE'].fillna(-1).astype(int).astype(str)
-
MaxU - stop genocide of UA almost 6 yearsVery nice solution! I think
df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
would be a bit more idiomatic, compared todf1.index = [...]
-
Bowen Liu almost 5 yearsThis is a great answer. Not OP but thanks. Would you care to explain what
sort_index(level=[0,1])
does please? -
jezrael almost 5 years@Bowen Liu Sure, it sorting index by first and second levels, idea is add totals to end of each level after concat.
-
Bowen Liu almost 5 yearsThanks a lot. I was learning about multiindex in pandas and your post was very helpful. If you just sort the lower level index (level = 1 is lower than level = 0? Please correct me if I'm wrong), would the higher level index be sorted altogether in the process? In other words, would
sort_index(level=1)
give the same result? -
jezrael almost 5 years@BowenLiu - no, it sorting by first level (first 'column' in MultiIndex), then by second. Maybe the best is sees in sample data
df = pd.DataFrame({ 'A':list('abcaba'), 'B':[4,3,4,3,5,4], 'C':[7,8,9,4,2,3], 'D':[1,3,5,7,1,0] }).set_index(['A','B','C'])
-
Bowen Liu almost 5 yearsAfter playing around with the example you provided, I have fully understood the function of
sort_index
. Thanks a lot. In what kind of situations can functions likeset_index()
andsort_index()
be really helpful? -
ant1j over 1 yearadding to @MaxU-stopWARagainstUA comment:
np.reapeat('Total', len(df1))
seems to be even faster (~100x)