How to sum in pandas by unique index in several columns?
Solution 1
IIUC you can use groupby
, sum
and reset_index
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df.groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
If first column User_ID
is index
:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.groupby(level=0)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
Or:
print df.groupby(df.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
EDIT:
As Alexander pointed, you need filter data before groupby
, if Session
dates is less as Registration
dates per User_ID
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
I change 3. row of data for better sample:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2012-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.Session >= df.Registration
User_ID
2349876 True
1987293 True
2234214 False
9874452 True
dtype: bool
print df[df.Session >= df.Registration]
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
9874452 2010-12-22 2014-08-22 2
df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2349876 2
2 9874452 2
Solution 2
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
gb = (df[df.Session >= df.Registration]
.groupby('User_ID')
.clicks.agg({'Total_Clicks': np.sum}))
>>> gb
Total_Clicks
User_ID
1987293 1
2234214 7
2349876 2
9874452 2
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.
Related videos on Youtube
ShanZhengYang
Updated on August 01, 2022Comments
-
ShanZhengYang almost 2 years
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
The organization of the dataframe is as follows:
User_ID Registration Session clicks 2349876 2012-02-22 2014-04-24 2 1987293 2011-02-01 2013-05-03 1 2234214 2012-07-22 2014-01-22 7 9874452 2010-12-22 2014-08-22 2 ...
(There is also an index above beginning with 0, but one could set
User_ID
as the index.)I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID Total_Clicks 2349876 722 1987293 341 2234214 220 9874452 1405 ...
How does one do this in pandas? Is this done by
.agg()
? EachUser_ID
needs to be summed individually.As there are 1.5 million records, does this scale?
-
jezrael about 8 yearsHmmm. Do you think is possible
Session
date is less asRegistration
date? I think logical not. But maybe I am wrong... -
Alexander about 8 yearsDepends on the setup. If not, there is no need to filter.
-
Alexander about 8 yearsIn the four rows of sample data, the session always exceeds the Registration Date. I assume both dates were given for a reason. If the the Session Date always equals or exceeds the registration date, then both dates are completely irrelevant to this problem as one could simply sum the clicks by user ID.
-
jezrael about 8 yearsYes, you are right. Btw, you can remove your first bracket
(
-
Alexander about 8 yearsThanks for the comment, but I was actually missing a final bracket. They are needed for multi-line statements.
-
ShanZhengYang about 8 years@ jezrael I believe that logically, no. The
Registration
date has to after theSession
. However, it is possible (I think) that a user with the same unique ID may register twice....So, a filter and some labeling is needed somehow.