Group DataFrame in 5-minute intervals
Solution 1
You can use df.resample
to do aggregation based on a date/time variable. You'll need a datetime index and you can specify that while reading the csv file:
df = pd.read_csv("filename.csv", parse_dates = [["DATE", "TIME"]], index_col=0)
This will result in a dataframe with an index where date and time are combined (source):
df.head()
Out[7]:
OPEN HIGH LOW CLOSE VOLUME
DATE_TIME
1997-02-03 09:04:00 3046.0 3048.5 3046.0 3047.5 505
1997-02-03 09:05:00 3047.0 3048.0 3046.0 3047.0 162
1997-02-03 09:06:00 3047.5 3048.0 3047.0 3047.5 98
1997-02-03 09:07:00 3047.5 3047.5 3047.0 3047.5 228
1997-02-03 09:08:00 3048.0 3048.0 3047.5 3048.0 136
After that you can use resample to get the sum, mean, etc. of those five minute intervals.
df.resample("5T").mean()
Out[8]:
OPEN HIGH LOW CLOSE VOLUME
DATE_TIME
1997-02-03 09:00:00 3046.0 3048.5 3046.0 3047.5 505.0
1997-02-03 09:05:00 3047.6 3047.9 3046.8 3047.3 159.6
1997-02-03 09:10:00 3045.6 3045.9 3044.8 3045.0 110.2
1997-02-03 09:15:00 3043.6 3044.0 3042.8 3043.2 69.2
1997-02-03 09:20:00 3044.7 3045.2 3044.5 3045.0 65.8
1997-02-03 09:25:00 3043.8 3044.0 3043.5 3043.7 59.0
1997-02-03 09:30:00 3044.6 3045.0 3044.3 3044.6 56.0
1997-02-03 09:35:00 3044.5 3044.5 3043.5 3044.5 44.0
(T is used for minute frequency. Here is a list of other units.)
Solution 2
Another way using pandas
is to use its TimeGrouper
-function.
Its purpose is just meant for use cases like yours.
import pandas as pd
df = pd.DataFrame("Your data provided above")
df["DATE"] = pd.to_datetime(df["DATE"])
df.set_index("DATE", inplace=True)
df = df.groupby(pd.TimeGrouper('5Min')).agg({
"OPEN": "first",
"HIGH": "max",
"LOW": "min",
"CLOSE": "last",
"VOLUME": "sum"
})
The provided script uses an aggregation you might have in mind since you're dealing with stock-data. It aggregates in a way that you will end up with the 5-min candles resulting from your 1-min candles.
Solution 3
slight modification to Markus answer. It groups and assign it to last index
df_close_left = data_set.groupby(pd.Grouper(freq='5Min',closed='right',label='right')).agg({
"open": "first",
"high": "max",
"low": "min",
"close": "last",
"volume": "sum"
})
Sam c21
Updated on July 28, 2022Comments
-
Sam c21 almost 2 years
How do I get just the 5 minute data using Python/pandas out of this csv? For every 5 minute interval I'm trying to get the DATE, TIME,OPEN, HIGH, LOW, CLOSE, VOLUME for that 5 minute interval.
DATE TIME OPEN HIGH LOW CLOSE VOLUME 02/03/1997 09:04:00 3046.00 3048.50 3046.00 3047.50 505 02/03/1997 09:05:00 3047.00 3048.00 3046.00 3047.00 162 02/03/1997 09:06:00 3047.50 3048.00 3047.00 3047.50 98 02/03/1997 09:07:00 3047.50 3047.50 3047.00 3047.50 228 02/03/1997 09:08:00 3048.00 3048.00 3047.50 3048.00 136 02/03/1997 09:09:00 3048.00 3048.00 3046.50 3046.50 174 02/03/1997 09:10:00 3046.50 3046.50 3045.00 3045.00 134 02/03/1997 09:11:00 3045.50 3046.00 3044.00 3045.00 43 02/03/1997 09:12:00 3045.00 3045.50 3045.00 3045.00 214 02/03/1997 09:13:00 3045.50 3045.50 3045.50 3045.50 8 02/03/1997 09:14:00 3045.50 3046.00 3044.50 3044.50 152
-
Sam c21 about 8 yearsI can't get this to work when I tried ("5T") or even ("H") and i get this DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]. But when I run just df.resample it works fine just without the interval adjustment.
-
ayhan about 8 yearsCan you explain what type of adjustment are you trying to do?
-
ayhan about 8 years
DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]
is the object as a result of calling resample. You can apply methods on that object (like taking the mean, standard deviation etc.). If you don't call any method it will just return the grouping. Can you add the example output you expect to the question? -
Sam c21 about 8 yearsI'm just trying to get just the 5 minute intervals without any kind of application methods like mean, standard deviation. Pretty much exactly what you did without mean just the 5 minute intervals
-
Sam c21 about 8 yearsprobably to get the answer I'm looking for is (highest high for high) and (lowest low for low) and the (first interval open for open) and (last interval close for close) and (sum the volume) all that inside 5 minute interval. I'm looking it up now to see if that's possible.
-
Sam c21 about 8 yearsI think I got I looked up resample thanks for teaching me that. And I'm just just gonna do it by each column and then combine all my columns. Really appreciate it. The hardest thing is finding all these things and learning how to use them correctly.
-
defender almost 4 yearsGood solution. But It groups 5 min data and add to the beginning of the time frame. Ex: it groups 9:00,9:01,9:022,9:03,9:04, and assign time value as 9:00 .But it is expected to group 9:01,9:022,9:03,9:04,9:05 and set it as 9:05
-
Markus almost 4 years@defender: You are correct in describing its behaviour :) However, for financial-data this is an often used aggregation and in general very common (this is also why the pandas-devs implemented it like this), so I would not describe it as unexpected or undesired result. But of course it important is to point out the way it works so that anybody knows what he/she is getting out of it - Thx for that! :)
-
Abhilash Awasthi over 2 yearsIn recent versions of pandas, use
pd.Grouper
-
Markus over 2 years@AbhilashAwasthi: Not entirely true, see the discussion between defender and me; the result varies.
-
Abhilash Awasthi over 2 years@Markus I was just referring the point that
pd.TimeGrouper
is deprecated in new versions of pandas and one should usepd.Grouper
. See this - stackoverflow.com/questions/45156289/… . I was not commenting on result variation. -
Je Je about 2 yearslink to pandas, as few parameters changed: pandas.pydata.org/docs/reference/api/…