pandas rolling sum of last five minutes

14,004

In general, if the dates are completely arbitrary, I think you would be forced to use a Python for-loop over the rows or use df.apply, (which under the hood, also uses a Python loop.)

However, if your Dates share a common frequency, as is the case above, then there is a trick which should be much quicker than using df.apply: Expand the timeseries according to the common frequency -- in this case, 1 minute -- fill in the NaNs with zeros, and then call rolling_sum:

In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]: 
                      A
Date                   
2014-11-21 11:00:00   1
2014-11-21 11:03:00   5
2014-11-21 11:04:00   6
2014-11-21 11:05:00   7
2014-11-21 11:07:00  11
2014-11-21 11:08:00   8
2014-11-21 11:12:00   2
2014-11-21 11:13:00   3

Of course, any time series has a common frequency if you are willing to accept a small enough granularity, but the required size of df.asfreq(...) may make this trick impractical.


Here is an example of the more general approach using df.apply. Note that calling searchsorted relies on df['Date'] being in sorted order.

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
    return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)

print(df[['Date', 'A', 'rolling_sum']])

yields

                 Date  A  rolling_sum
0 2014-11-21 11:00:00  1            1
1 2014-11-21 11:03:00  4            5
2 2014-11-21 11:04:00  1            6
3 2014-11-21 11:05:00  2            7
4 2014-11-21 11:07:00  4           11
5 2014-11-21 11:08:00  1            8
6 2014-11-21 11:12:00  1            2
7 2014-11-21 11:13:00  2            3

Here is a benchmark comparing the df.asfreq trick versus calling df.apply:

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')

def big_df(df):
    df = df.copy()
    for i in range(7):
        dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
        df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
        df = pd.concat([df, df2])
    df = df.reset_index(drop=True)
    return df

def using_apply():
    start_dates = df['Date'] - pd.Timedelta(minutes=5)
    df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
    df['end_index'] = np.arange(len(df))

    def sum_window(row):
        return df['A'].iloc[row['start_index']:row['end_index']+1].sum()

    df['rolling_sum'] = df.apply(sum_window, axis=1)
    return df[['Date', 'rolling_sum']]

def using_asfreq():
    result = (pd.rolling_sum(
        df.set_index(['Date']).asfreq('1T').fillna(0), 
        window=5, min_periods=1).reindex(df['Date']))
    return result

In [364]: df = big_df(df)

In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop

In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop
Share:
14,004
Sajith Edirisinghe
Author by

Sajith Edirisinghe

Back-end Java developer who loves ReactJS.

Updated on June 09, 2022

Comments

  • Sajith Edirisinghe
    Sajith Edirisinghe about 2 years

    Assume I have below data frame

    Date, A
    2014-11-21 11:00:00, 1
    2014-11-21 11:03:00, 4
    2014-11-21 11:04:00, 1
    2014-11-21 11:05:00, 2
    2014-11-21 11:07:00, 4
    2014-11-21 11:08:00, 1
    2014-11-21 11:12:00, 1
    2014-11-21 11:13:00, 2
    

    First column is datetime object and second column is an integer. What I want is to calculate the sum of column 'A' for the last five minute for each row.

    As an example for the row 2014-11-21 11:12:00, 1, the sum of column 'A' would be 2(1+1) and the sum of column 'A' for the row 2014-11-21 11:05:00, 2 would be 7(2+1+4). Important thing is that the number of past rows for the time window(5 minutes) is not the same for each row (Because time series is irregular).

    How can I get the sum of last five minutes for column 'A' using the rolling_sum method in pandas? Thanks in advance.

  • Sajith Edirisinghe
    Sajith Edirisinghe over 9 years
    Thank you very much for the quick answer. I cannot use df.asfreq(...) method since the smallest granularity in my dataset is seconds and I have millions of rows. But df.apply method does the trick.
  • nilesh
    nilesh about 7 years
    The general approach works, only one thing to be kept in mind if its being used in long code: the function sum_window does not uses input df as explicitly so need to be careful.