Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)

45,860

Solution 1

agg with a dict of functions

Create a dict of functions and pass it to agg. You'll also need as_index=False to prevent the group columns from becoming the index in your output.

f = {'NET_AMT': 'sum', 'QTY_SOLD': 'sum', 'UPC_DSC': 'first'}
df.groupby(['month', 'UPC_ID'], as_index=False).agg(f)

     month  UPC_ID UPC_DSC  NET_AMT  QTY_SOLD
0  2017.02     111   desc1       10         2
1  2017.02     222   desc2       15         3
2  2017.02     333   desc3        4         1
3  2017.03     111   desc1       25         5

Blanket sum

Just call sum without any column names. This handles the numeric columns. For UPC_DSC, you'll need to handle it separately.

g = df.groupby(['month', 'UPC_ID'])
i = g.sum()
j = g[['UPC_DSC']].first()

pd.concat([i, j], 1).reset_index()

     month  UPC_ID  QTY_SOLD  NET_AMT UPC_DSC
0  2017.02     111         2       10   desc1
1  2017.02     222         3       15   desc2
2  2017.02     333         1        4   desc3
3  2017.03     111         5       25   desc1

Solution 2

I am thinking about this long time, thanks for your question push me to make it .By using agg and if...else

df.groupby(['month', 'UPC_ID'],as_index=False).agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))
Out[1221]: 
   month  UPC_ID UPC_DSC     D_DATE  QTY_SOLD  NET_AMT
0      2     111   desc1 2017-02-26         2       10
1      2     222   desc2 2017-02-26         3       15
2      2     333   desc3 2017-02-26         1        4
3      3     111   desc1 2017-03-01         5       25
Share:
45,860

Related videos on Youtube

user3871
Author by

user3871

Updated on December 31, 2020

Comments

  • user3871
    user3871 over 3 years

    I have the following data frame and want to:

    • Group records by month
    • Sum QTY_SOLDand NET_AMT of each unique UPC_ID(per month)
    • Include the rest of the columns as well in the resulting dataframe

    The way I thought I can do this is 1st: create a month column to aggregate the D_DATES, then sum QTY_SOLD by UPC_ID.

    Script:

    # Convert date to date time object
    df['D_DATE'] = pd.to_datetime(df['D_DATE'])
    
    # Create aggregated months column
    df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m',))
    
    # Group by month and sum up quantity sold by UPC_ID
    df = df.groupby(['month', 'UPC_ID'])['QTY_SOLD'].sum()
    

    Current data frame:

    UPC_ID | UPC_DSC | D_DATE | QTY_SOLD | NET_AMT
    ----------------------------------------------
    111      desc1    2/26/2017   2         10 (2 x $5)
    222      desc2    2/26/2017   3         15
    333      desc3    2/26/2017   1         4
    111      desc1    3/1/2017    1         5
    111      desc1    3/3/2017    4         20
    

    Desired Output:

    MONTH | UPC_ID | QTY_SOLD | NET_AMT | UPC_DSC
    ----------------------------------------------
    2017-2      111     2         10       etc...
    2017-2      222     3         15
    2017-2      333     1         4
    2017-3      111     5         25
    

    Actual Output:

    MONTH | UPC_ID  
    ----------------------------------------------
    2017-2      111     2
                222     3
                333     1
    2017-3      111     5
    ...  
    

    Questions:

    • How do I include the month for each row?
    • How do I include the rest of the columns of the dataframe?
    • How do also sum NET_AMT in addition to QTY_SOLD?