Convert daily pandas stock data to monthly data using first trade day of the month

14,949

Solution 1

Instead of M you can pass MS as the resample rule:

df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))

#df.resample('MS', how = 'mean')    # pandas <0.18
df.resample('MS').mean()  # pandas >= 0.18

Updated to use the first business day of the month respecting US Federal Holidays:

df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))

from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())

df.resample(bmth_us).mean()

if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).

month_index =df.index.to_period('M')

min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())

custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)

Pass custom_start_months to the fist parameter of resample

Solution 2

Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:

#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)

This yielded the following:

>>> mthly_ohlcva
                Volume  Adj Close    High     Low   Close    Open
Date                                                             
2012-12-03  2889875900     136.92  145.58  139.54  142.41  142.80
2013-01-01  2587140200     143.92  150.94  144.73  149.70  145.11
2013-02-01  2581459300     145.76  153.28  148.73  151.61  150.65
2013-03-01  2330972300     151.30  156.85  150.41  156.67  151.09
2013-04-01  2907035000     154.20  159.72  153.55  159.68  156.59
2013-05-01  2781596000     157.84  169.07  158.10  163.45  159.33
2013-06-03  3533321800     155.74  165.99  155.73  160.42  163.83
2013-07-01  2330904500     163.78  169.86  160.22  168.71  161.26
2013-08-01  2283131700     158.87  170.97  163.05  163.65  169.99
2013-09-02  2226749600     163.90  173.60  163.70  168.01  165.23
2013-10-01  2901739000     171.49  177.51  164.53  175.79  168.14
2013-11-01  1930952900     176.57  181.75  174.76  181.00  176.02
2013-12-02  2232775900     181.15  184.69  177.32  184.69  181.09
Share:
14,949
user2766344
Author by

user2766344

Updated on June 06, 2022

Comments

  • user2766344
    user2766344 almost 2 years

    I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:

    >>> type(data_dy)
    <class 'pandas.core.frame.DataFrame'>
    >>> data_dy
                  Open    High     Low   Close     Volume  Adj Close
    Date                                                            
    2012-12-28  140.64  141.42  139.87  140.03  148806700     134.63
    2012-12-31  139.66  142.56  139.54  142.41  243935200     136.92
    2013-01-02  145.11  146.15  144.73  146.06  192059000     140.43
    2013-01-03  145.99  146.37  145.34  145.73  144761800     140.11
    2013-01-04  145.97  146.61  145.67  146.37  116817700     140.72
    
    [5 rows x 6 columns]
    

    I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:

    >>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
    
    >>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
                   Volume  Adj Close    High     Low   Close    Open
    Date                                                            
    2012-12-31  392741900     136.92  142.56  139.54  142.41  140.64
    2013-01-31  453638500     140.72  146.61  144.73  146.37  145.11
    
    [2 rows x 6 columns]
    

    This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.

    So I'd like the answer set to be:

                   Volume  Adj Close    High     Low   Close    Open
    Date                                                            
    2012-12-28  392741900     136.92  142.56  139.54  142.41  140.64
    2013-01-02  453638500     140.72  146.61  144.73  146.37  145.11
    

    I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?

  • user2766344
    user2766344 over 9 years
    Thank you, I wasn't aware of this option. However, this results in the month start date, not the oldest date of the month in the data. See here: >>> data_dy.resample('MS', how=ohlc_dict) Volume Adj Close High Low Close Open Date 2012-12-01 392741900 136.92 142.56 139.54 142.41 140.64 2013-01-01 453638500 140.72 146.61 144.73 146.37 145.11 [2 rows x 6 columns]
  • user2766344
    user2766344 over 9 years
    I also found this list of example resample options here: stackoverflow.com/questions/17001389/….
  • user2766344
    user2766344 over 9 years
    Trying the 'BMS' option yields: >>> data_dy.resample('BMS', how=ohlc_dict) Volume Adj Close High Low Close Open Date 2012-12-03 392741900 136.92 142.56 139.54 142.41 140.64 2013-01-01 453638500 140.72 146.61 144.73 146.37 145.11 Still not quite what I was looking for.
  • JAB
    JAB over 9 years
    I am not clear on what 2012-12-28 represents? The first day of which period?
  • JAB
    JAB over 9 years
    The original question seems to ask for the first trading day of the month. Is this correct? Your comment makes me think otherwise.
  • user2766344
    user2766344 over 9 years
    I'm sorry if that confused things. In my example I only listed truncated data, but in the truncated data set it means the first trading day on December 2012 (obviously not a real situation). I have calculated data for non-us markets, so the first trading day of the month varies between markets, depending on local holidays. It would be most flexible if the date returned is the date from the first data row in each month and use the dictionary rules for the other columns.
  • JAB
    JAB over 9 years
    If you create a timeseries index of the first dates in the data by month and pass them to thecalendar parameter in CustomBusinessMonthBegin this should work.
  • user2766344
    user2766344 over 9 years
    @J_Bradley Thank you for your patience and expertise. I will try this solution.
  • JAB
    JAB over 9 years
    @user2766344 you are welcome. Update to create an timeseries index using the minimum date in the index. There is likely a more elegant way to do this, but I am not aware of it.
  • TravelTrader
    TravelTrader over 4 years
    AttributeError: 'Index' object has no attribute 'to_period' - any ideas how to solve that?
  • TravelTrader
    TravelTrader over 4 years
    Sounded great but even with df.index = pd.to_datetime(df.index, format='%Y-%m-%d', errors='ignore') there's an error: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'. Maybe someone has a solution?