Python - Pandas subtotals on groupby

12,711

Use:

#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()

#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex 
df1.index = [df1.index.get_level_values(0),
            df1.index.get_level_values(1),
            ['Total'] * len(df1)]

#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True) 

print (df1)
SCENARIO  STATUS       
AAA       51      Total    3
          53      Total    1
BBB       51      Total    1
CCC       51      Total    1
Name: TYPE, dtype: int64

#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO  STATUS  TYPE 
AAA       51      1        2
                  9        1
                  Total    3
          53      228      1
                  Total    1
BBB       51      43       1
                  Total    1
CCC       51      187      1
                  Total    1
Name: TYPE, dtype: int64
Share:
12,711
onlyf
Author by

onlyf

Updated on June 04, 2022

Comments

  • onlyf
    onlyf over 1 year

    here's a sample of the data i m using :

    SCENARIO    DATE    POD         AREA    IDOC    STATUS  TYPE
    AAA   02.06.2015    JKJKJKJKJKK 4210    713375  51         1
    AAA   02.06.2015    JWERWERE    4210    713375  51         1
    AAA   02.06.2015    JAFDFDFDFD  4210    713375  51         9
    BBB   02.06.2015    AAAAAAAA    5400    713504  51        43
    CCC   05.06.2015    BBBBBBBBBB  4100    756443  51       187
    AAA   05.06.2015    EEEEEEEE    4100    756457  53       228
    

    I have written the following code in pandas to groupby:

    import pandas as pd
    import numpy as np
    
    xl = pd.ExcelFile("MRD.xlsx")
    df = xl.parse("Sheet3") 
    #print (df.column.values)
    
    # The following gave ValueError: Cannot label index with a null key
    # dfi = df.pivot('SCENARIO)
    
    # Here i do not actually need it to count every column, just a specific one
    table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count'])
    writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter')
    table.to_excel(writer, sheet_name='Sheet1')
    writer.save()
    
    
    table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count())
    print (table2)
    writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter')
    table2.to_excel(writer2, sheet_name='Sheet1')
    writer2.save()
    

    this yields a result :

    SCENARIO  STATUS  TYPE    TYPE
    AAA       51      1       2
                      9       1
              53      228     1
    BBB       51      43      1
    CCC       51      187     1
    Name: TYPE, dtype: int64   
    

    How could i add subtotals per group? Ideally i would want to achieve something like:

    SCENARIO  STATUS  TYPE    TYPE
    AAA       51      1       2
                      9       1
              Total           3
              53      228     1
              Total           1
    BBB       51      43      1
              Total           1
    CCC       51      187     1
              Total           1
    Name: TYPE, dtype: int64   
    

    Is this possible?

    • jezrael
      jezrael almost 6 years
      Is problem if Total is in TYPE level?
  • onlyf
    onlyf almost 6 years
    Traceback (most recent call last): File "C:\Anaconda3\lib\site-packages\pandas\indexes\multi.py", line 707, in get_value return _index.get_value_at(s, k) File "pandas\index.pyx", line 65, in pandas.index.get_value_at (pandas\index.c:2759) File "pandas\src\util.pxd", line 63, in util.get_value_at (pandas\index.c:16180) TypeError: 'str' object cannot be interpreted as an integer
  • onlyf
    onlyf almost 6 years
    There's another error in pandas, eventually i get: KeyError: 'SCENARIO'
  • jezrael
    jezrael almost 6 years
    for first error need df['TYPE'] = df['TYPE'].astype(str) and for second use df1 = df.groupby(level=["SCENARIO", "STATUS"]).sum()
  • onlyf
    onlyf almost 6 years
    last fix, and sorry for the trouble, but in my output the third column, 'TYPE' once more comes out as floating points ie 155.0, is there a way to cast this into integers?
  • jezrael
    jezrael almost 6 years
    Is possible use df['TYPE'] = df['TYPE'].astype(int).astype(str) ? Or there are some NaNs ?
  • onlyf
    onlyf almost 6 years
    We have nans, and i get the "Cannot convert NaN to integer" with it, is there a way around that?
  • jezrael
    jezrael almost 6 years
    Is possible remove rows with nans by df = df.dropna(subset=['TYPE']) and then df['TYPE'] = df['TYPE'].astype(int).astype(str) ?
  • onlyf
    onlyf almost 6 years
    This worked, sadly, i d like to keep count of the NaN's as well, not necessarily on the same df, perhaps on a separate one. Thanks for all the help!
  • onlyf
    onlyf almost 6 years
    perhaps if we could replace NaN's with a default value?
  • jezrael
    jezrael almost 6 years
    yes, it ispossible. like df['TYPE'] = df['TYPE'].fillna(-1).astype(int).astype(str)
  • MaxU - stop genocide of UA
    MaxU - stop genocide of UA almost 6 years
    Very nice solution! I think df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True) would be a bit more idiomatic, compared to df1.index = [...]
  • Bowen Liu
    Bowen Liu almost 5 years
    This is a great answer. Not OP but thanks. Would you care to explain what sort_index(level=[0,1]) does please?
  • jezrael
    jezrael almost 5 years
    @Bowen Liu Sure, it sorting index by first and second levels, idea is add totals to end of each level after concat.
  • Bowen Liu
    Bowen Liu almost 5 years
    Thanks a lot. I was learning about multiindex in pandas and your post was very helpful. If you just sort the lower level index (level = 1 is lower than level = 0? Please correct me if I'm wrong), would the higher level index be sorted altogether in the process? In other words, would sort_index(level=1) give the same result?
  • jezrael
    jezrael almost 5 years
    @BowenLiu - no, it sorting by first level (first 'column' in MultiIndex), then by second. Maybe the best is sees in sample data df = pd.DataFrame({ 'A':list('abcaba'), 'B':[4,3,4,3,5,4], 'C':[7,8,9,4,2,3], 'D':[1,3,5,7,1,0] }).set_index(['A','B','C'])
  • Bowen Liu
    Bowen Liu almost 5 years
    After playing around with the example you provided, I have fully understood the function of sort_index. Thanks a lot. In what kind of situations can functions like set_index() and sort_index() be really helpful?
  • ant1j
    ant1j over 1 year
    adding to @MaxU-stopWARagainstUA comment: np.reapeat('Total', len(df1)) seems to be even faster (~100x)