Grouping by date range with pandas

14,945

Solution 1

I'd convert this to a datetime column and then use pd.TimeGrouper:

dates =  pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0   2017-01-01
1   2017-01-01
2   2017-01-01
3   2017-01-01
4   2017-01-02
5   2017-01-02
6   2017-01-10
7   2017-02-01
Name: date, dtype: datetime64[ns]

df = (df.assign(date=dates).set_index('date')
        .groupby(['user_id', pd.TimeGrouper('3D')])
        .sum()
        .reset_index())    
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

Similar solution using pd.Grouper:

df = (df.assign(date=dates)
        .groupby(['user_id', pd.Grouper(key='date', freq='3D')])
        .sum()
        .reset_index())
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).

Solution 2

I come with a very ugly solution but still work...

df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})

Out[586]: 
   user_id  Key  val       date
0        1    1    3 2017-01-01
1        2    2    2 2017-01-01
2        2    3    1 2017-01-10
3        3    4    1 2017-01-01
4        3    5    1 2017-02-01
Share:
14,945
eljusticiero67
Author by

eljusticiero67

Updated on June 11, 2022

Comments

  • eljusticiero67
    eljusticiero67 almost 2 years

    I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y

    user_id     date       val
    1           1-1-17     1
    2           1-1-17     1
    3           1-1-17     1
    1           1-1-17     1
    1           1-2-17     1
    2           1-2-17     1
    2           1-10-17    1
    3           2-1-17     1
    

    The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:

    user_id     date       sum(val)
    1           1-2-17     3
    2           1-2-17     2
    2           1-10-17    1
    3           1-1-17     1
    3           2-1-17     1
    

    Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..

    Thanks!

  • BENY
    BENY over 6 years
    I always afraid to touch any time related question ... LOL btw +1
  • Vaishali
    Vaishali over 6 years
    Amazing, never used grouper somehow
  • BENY
    BENY over 6 years
    Grouper is TimeGrouper
  • cs95
    cs95 over 6 years
    Thanks both :) @Wen, yeah I used to run away from date problems as well. Also, yeah, you're right, the only difference being TimeGrouper needs the index to be a datetime index.
  • cs95
    cs95 over 6 years
    @Wen It was my first choice, but the datetime column seems to disappear... uff... I didn't like reset_index either but no choice..
  • BENY
    BENY over 6 years
    @cᴏʟᴅsᴘᴇᴇᴅ Yeah ,i noticed that too :-)
  • Vaishali
    Vaishali over 6 years
    Apparently TimeGrouper is being deprecated so Grouper would be the best answer
  • eljusticiero67
    eljusticiero67 over 6 years
    @cᴏʟᴅsᴘᴇᴇᴅ sir, may i eat your brain so i may steal your powers? this is awesome!!