How to calculate time difference by group using pandas?

20,433

You can use sort_values with groupby and aggregating diff:

df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id                time     diff
0  A 2016-11-25 16:32:17      NaT
1  A 2016-11-25 16:36:04 00:00:35
2  A 2016-11-25 16:35:29 00:03:12
3  B 2016-11-25 16:35:24      NaT
4  B 2016-11-25 16:35:46 00:00:22

If need remove rows with NaT in column diff use dropna:

df = df.dropna(subset=['diff'])
print (df)
  id                time     diff
2  A 2016-11-25 16:35:29 00:03:12
1  A 2016-11-25 16:36:04 00:00:35
4  B 2016-11-25 16:35:46 00:00:22

You can also overwrite column:

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id     time
0  A      NaT
1  A 00:00:35
2  A 00:03:12
3  B      NaT
4  B 00:00:22

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
  id     time
1  A 00:00:35
2  A 00:03:12
4  B 00:00:22
Share:
20,433

Related videos on Youtube

Jack
Author by

Jack

A student

Updated on July 09, 2022

Comments

  • Jack
    Jack almost 2 years

    Problem

    I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.

    The original data :

    In [37]: df 
    Out[37]:
      id                time
    0  A 2016-11-25 16:32:17
    1  A 2016-11-25 16:36:04
    2  A 2016-11-25 16:35:29
    3  B 2016-11-25 16:35:24
    4  B 2016-11-25 16:35:46
    

    The result I want

    Out[40]:
       id   time
    0  A   00:35
    1  A   03:12
    2  B   00:22
    

    notice: the type of time col is timedelta64[ns]

    Trying

    In [38]: df['time'].diff(1)
    Out[38]:
    0                 NaT
    1            00:03:47
    2   -1 days +23:59:25
    3   -1 days +23:59:55
    4            00:00:22
    Name: time, dtype: timedelta64[ns]
    

    Don't get desired result.

    Hope

    Not only solve the problem but the code can run fast because there are 50 million rows.

    • EdChum
      EdChum over 7 years
      You need to sort the data first df = df.sort_values(['id','time']), then you can do df.groupby('id')['time'].diff()
  • LeMarque
    LeMarque almost 2 years
    thanks for this trick. but If i needed something in only seconds? your answer will return something like 0 days 02:20:25
  • LeMarque
    LeMarque almost 2 years
    i used and it worked... thanks Jezrael ... df.time = df.sort_values(['id','time']).groupby('id')['time'].diff().d‌​t.total_seconds()