Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)
45,860
Solution 1
agg
with a dict
of functions
Create a dict
of functions and pass it to agg
. You'll also need as_index=False
to prevent the group columns from becoming the index in your output.
f = {'NET_AMT': 'sum', 'QTY_SOLD': 'sum', 'UPC_DSC': 'first'}
df.groupby(['month', 'UPC_ID'], as_index=False).agg(f)
month UPC_ID UPC_DSC NET_AMT QTY_SOLD
0 2017.02 111 desc1 10 2
1 2017.02 222 desc2 15 3
2 2017.02 333 desc3 4 1
3 2017.03 111 desc1 25 5
Blanket sum
Just call sum
without any column names. This handles the numeric columns. For UPC_DSC
, you'll need to handle it separately.
g = df.groupby(['month', 'UPC_ID'])
i = g.sum()
j = g[['UPC_DSC']].first()
pd.concat([i, j], 1).reset_index()
month UPC_ID QTY_SOLD NET_AMT UPC_DSC
0 2017.02 111 2 10 desc1
1 2017.02 222 3 15 desc2
2 2017.02 333 1 4 desc3
3 2017.03 111 5 25 desc1
Solution 2
I am thinking about this long time, thanks for your question push me to make it .By using agg
and if...else
df.groupby(['month', 'UPC_ID'],as_index=False).agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))
Out[1221]:
month UPC_ID UPC_DSC D_DATE QTY_SOLD NET_AMT
0 2 111 desc1 2017-02-26 2 10
1 2 222 desc2 2017-02-26 3 15
2 2 333 desc3 2017-02-26 1 4
3 3 111 desc1 2017-03-01 5 25
Related videos on Youtube
Author by
user3871
Updated on December 31, 2020Comments
-
user3871 over 3 years
I have the following data frame and want to:
- Group records by
month
- Sum
QTY_SOLD
andNET_AMT
of each uniqueUPC_ID
(per month) - Include the rest of the columns as well in the resulting dataframe
The way I thought I can do this is 1st: create a
month
column to aggregate theD_DATES
, then sumQTY_SOLD
byUPC_ID
.Script:
# Convert date to date time object df['D_DATE'] = pd.to_datetime(df['D_DATE']) # Create aggregated months column df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m',)) # Group by month and sum up quantity sold by UPC_ID df = df.groupby(['month', 'UPC_ID'])['QTY_SOLD'].sum()
Current data frame:
UPC_ID | UPC_DSC | D_DATE | QTY_SOLD | NET_AMT ---------------------------------------------- 111 desc1 2/26/2017 2 10 (2 x $5) 222 desc2 2/26/2017 3 15 333 desc3 2/26/2017 1 4 111 desc1 3/1/2017 1 5 111 desc1 3/3/2017 4 20
Desired Output:
MONTH | UPC_ID | QTY_SOLD | NET_AMT | UPC_DSC ---------------------------------------------- 2017-2 111 2 10 etc... 2017-2 222 3 15 2017-2 333 1 4 2017-3 111 5 25
Actual Output:
MONTH | UPC_ID ---------------------------------------------- 2017-2 111 2 222 3 333 1 2017-3 111 5 ...
Questions:
- How do I include the month for each row?
- How do I include the rest of the columns of the dataframe?
- How do also sum
NET_AMT
in addition toQTY_SOLD
?
- Group records by