Group DataFrame in 5-minute intervals

python python-2.7 datetime pandas dataframe

19,488

Solution 1

You can use df.resample to do aggregation based on a date/time variable. You'll need a datetime index and you can specify that while reading the csv file:

df = pd.read_csv("filename.csv", parse_dates = [["DATE", "TIME"]], index_col=0)

This will result in a dataframe with an index where date and time are combined (source):

df.head()
Out[7]: 
                       OPEN    HIGH     LOW   CLOSE  VOLUME 
DATE_TIME                                                   
1997-02-03 09:04:00  3046.0  3048.5  3046.0  3047.5      505
1997-02-03 09:05:00  3047.0  3048.0  3046.0  3047.0      162
1997-02-03 09:06:00  3047.5  3048.0  3047.0  3047.5       98
1997-02-03 09:07:00  3047.5  3047.5  3047.0  3047.5      228
1997-02-03 09:08:00  3048.0  3048.0  3047.5  3048.0      136

After that you can use resample to get the sum, mean, etc. of those five minute intervals.

df.resample("5T").mean()
Out[8]: 
                       OPEN    HIGH     LOW   CLOSE  VOLUME 
DATE_TIME                                                   
1997-02-03 09:00:00  3046.0  3048.5  3046.0  3047.5    505.0
1997-02-03 09:05:00  3047.6  3047.9  3046.8  3047.3    159.6
1997-02-03 09:10:00  3045.6  3045.9  3044.8  3045.0    110.2
1997-02-03 09:15:00  3043.6  3044.0  3042.8  3043.2     69.2
1997-02-03 09:20:00  3044.7  3045.2  3044.5  3045.0     65.8
1997-02-03 09:25:00  3043.8  3044.0  3043.5  3043.7     59.0
1997-02-03 09:30:00  3044.6  3045.0  3044.3  3044.6     56.0
1997-02-03 09:35:00  3044.5  3044.5  3043.5  3044.5     44.0

(T is used for minute frequency. Here is a list of other units.)

Solution 2

Another way using pandas is to use its TimeGrouper-function. Its purpose is just meant for use cases like yours.

import pandas as pd

df = pd.DataFrame("Your data provided above")
df["DATE"] = pd.to_datetime(df["DATE"])
df.set_index("DATE", inplace=True)

df = df.groupby(pd.TimeGrouper('5Min')).agg({
                                        "OPEN":  "first",
                                        "HIGH":  "max",
                                        "LOW":   "min",
                                        "CLOSE": "last",
                                        "VOLUME": "sum"
                                    })

The provided script uses an aggregation you might have in mind since you're dealing with stock-data. It aggregates in a way that you will end up with the 5-min candles resulting from your 1-min candles.

Solution 3

slight modification to Markus answer. It groups and assign it to last index

df_close_left = data_set.groupby(pd.Grouper(freq='5Min',closed='right',label='right')).agg({
                                        "open":  "first",
                                        "high":  "max",
                                        "low":   "min",
                                        "close": "last",
                                        "volume": "sum"

                                    })

19,488

Author by

Sam c21

Updated on July 28, 2022

Comments

Sam c21 almost 2 years

How do I get just the 5 minute data using Python/pandas out of this csv? For every 5 minute interval I'm trying to get the DATE, TIME,OPEN, HIGH, LOW, CLOSE, VOLUME for that 5 minute interval.

DATE       TIME     OPEN    HIGH    LOW     CLOSE   VOLUME
02/03/1997 09:04:00 3046.00 3048.50 3046.00 3047.50 505          
02/03/1997 09:05:00 3047.00 3048.00 3046.00 3047.00 162          
02/03/1997 09:06:00 3047.50 3048.00 3047.00 3047.50 98           
02/03/1997 09:07:00 3047.50 3047.50 3047.00 3047.50 228          
02/03/1997 09:08:00 3048.00 3048.00 3047.50 3048.00 136          
02/03/1997 09:09:00 3048.00 3048.00 3046.50 3046.50 174          
02/03/1997 09:10:00 3046.50 3046.50 3045.00 3045.00 134          
02/03/1997 09:11:00 3045.50 3046.00 3044.00 3045.00 43           
02/03/1997 09:12:00 3045.00 3045.50 3045.00 3045.00 214          
02/03/1997 09:13:00 3045.50 3045.50 3045.50 3045.50 8            
02/03/1997 09:14:00 3045.50 3046.00 3044.50 3044.50 152

Sam c21 about 8 years

I can't get this to work when I tried ("5T") or even ("H") and i get this DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]. But when I run just df.resample it works fine just without the interval adjustment.
ayhan about 8 years

Can you explain what type of adjustment are you trying to do?
ayhan about 8 years

DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0] is the object as a result of calling resample. You can apply methods on that object (like taking the mean, standard deviation etc.). If you don't call any method it will just return the grouping. Can you add the example output you expect to the question?
Sam c21 about 8 years

I'm just trying to get just the 5 minute intervals without any kind of application methods like mean, standard deviation. Pretty much exactly what you did without mean just the 5 minute intervals
Sam c21 about 8 years

probably to get the answer I'm looking for is (highest high for high) and (lowest low for low) and the (first interval open for open) and (last interval close for close) and (sum the volume) all that inside 5 minute interval. I'm looking it up now to see if that's possible.
Sam c21 about 8 years

I think I got I looked up resample thanks for teaching me that. And I'm just just gonna do it by each column and then combine all my columns. Really appreciate it. The hardest thing is finding all these things and learning how to use them correctly.
defender almost 4 years

Good solution. But It groups 5 min data and add to the beginning of the time frame. Ex: it groups 9:00,9:01,9:022,9:03,9:04, and assign time value as 9:00 .But it is expected to group 9:01,9:022,9:03,9:04,9:05 and set it as 9:05
Markus almost 4 years

@defender: You are correct in describing its behaviour :) However, for financial-data this is an often used aggregation and in general very common (this is also why the pandas-devs implemented it like this), so I would not describe it as unexpected or undesired result. But of course it important is to point out the way it works so that anybody knows what he/she is getting out of it - Thx for that! :)
Abhilash Awasthi over 2 years

In recent versions of pandas, use pd.Grouper
Markus over 2 years

@AbhilashAwasthi: Not entirely true, see the discussion between defender and me; the result varies.
Abhilash Awasthi over 2 years

@Markus I was just referring the point that pd.TimeGrouper is deprecated in new versions of pandas and one should use pd.Grouper. See this - stackoverflow.com/questions/45156289/… . I was not commenting on result variation.
Je Je about 2 years

link to pandas, as few parameters changed: pandas.pydata.org/docs/reference/api/…