How to sum in pandas by unique index in several columns?

40,395

Solution 1

IIUC you can use groupby, sum and reset_index:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

If first column User_ID is index:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

Or:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

EDIT:

As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

I change 3. row of data for better sample:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool

print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2

df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2

Solution 2

The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.

gb = (df[df.Session >= df.Registration]
      .groupby('User_ID')
      .clicks.agg({'Total_Clicks': np.sum}))

>>> gb
         Total_Clicks
User_ID              
1987293             1
2234214             7
2349876             2
9874452             2

For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.

Share:
40,395

Related videos on Youtube

ShanZhengYang
Author by

ShanZhengYang

Updated on August 01, 2022

Comments

  • ShanZhengYang
    ShanZhengYang almost 2 years

    I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.

    The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.

    The organization of the dataframe is as follows:

    User_ID    Registration  Session      clicks
    2349876    2012-02-22    2014-04-24   2 
    1987293    2011-02-01    2013-05-03   1 
    2234214    2012-07-22    2014-01-22   7 
    9874452    2010-12-22    2014-08-22   2 
    ...
    

    (There is also an index above beginning with 0, but one could set User_ID as the index.)

    I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".

    User_ID    Total_Clicks
    2349876    722 
    1987293    341
    2234214    220 
    9874452    1405 
    ...
    

    How does one do this in pandas? Is this done by .agg()? Each User_ID needs to be summed individually.

    As there are 1.5 million records, does this scale?

  • jezrael
    jezrael about 8 years
    Hmmm. Do you think is possible Session date is less as Registration date? I think logical not. But maybe I am wrong...
  • Alexander
    Alexander about 8 years
    Depends on the setup. If not, there is no need to filter.
  • Alexander
    Alexander about 8 years
    In the four rows of sample data, the session always exceeds the Registration Date. I assume both dates were given for a reason. If the the Session Date always equals or exceeds the registration date, then both dates are completely irrelevant to this problem as one could simply sum the clicks by user ID.
  • jezrael
    jezrael about 8 years
    Yes, you are right. Btw, you can remove your first bracket (
  • Alexander
    Alexander about 8 years
    Thanks for the comment, but I was actually missing a final bracket. They are needed for multi-line statements.
  • ShanZhengYang
    ShanZhengYang about 8 years
    @ jezrael I believe that logically, no. The Registration date has to after the Session. However, it is possible (I think) that a user with the same unique ID may register twice....So, a filter and some labeling is needed somehow.