Pandas rolling regression: alternatives to looping

12,709

Solution 1

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

It has three core classes:

  • OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
  • RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
  • PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

An example:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64

Solution 2

Use a custom rolling apply function.

import numpy as np

df['slope'] = df.values.rolling(window=125).apply(lambda x: np.polyfit(np.array(range(0,125)), x, 1)[0], raw=True)
Share:
12,709
Brad Solomon
Author by

Brad Solomon

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies." Stack that I'm most comfortable with: python javascript bash git postgresql redis Where I'm spending my free time: go cython c++ scala

Updated on June 07, 2022

Comments

  • Brad Solomon
    Brad Solomon almost 2 years

    I got good use out of pandas' MovingOLS class (source here) within the deprecated stats/ols module. Unfortunately, it was gutted completely with pandas 0.20.

    The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.

    Here are my questions:

    1. How can I best mimic the basic framework of pandas' MovingOLS? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series--i.e. coefficients, r-squared, t-statistics, etc without needing to re-run regression. For example, you could create something like model = pd.MovingOLS(y, x) and then call .t_stat, .rmse, .std_err, and the like. In the example below, conversely, I don't see a way around being forced to compute each statistic separately. Is there a method that doesn't involve creating sliding/rolling "blocks" (strides) and running regressions/using linear algebra to get model parameters for each?

    2. More broadly, what's going on under the hood in pandas that makes rolling.apply not able to take more complex functions?* When you create a .rolling object, in layman's terms, what's going on internally--is it fundamentally different from looping over each window and creating a higher-dimensional array as I'm doing below?

    *Namely, func passed to .apply:

    Must produce a single value from an ndarray input *args and **kwargs are passed to the function

    Here's where I'm currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn't make a ton of sense; just picked these randomly.) I've taken it out of a class-based implementation and tried to strip it down to a simpler script.

    from datetime import date
    from pandas_datareader.data import DataReader
    import statsmodels.formula.api as smf
    
    syms = {'TWEXBMTH' : 'usd', 
            'T10Y2YM' : 'term_spread', 
            'PCOPPUSDM' : 'copper'
           }
    
    start = date(2000, 1, 1)
    data = (DataReader(syms.keys(), 'fred', start)
            .pct_change()
            .dropna())
    data = data.rename(columns = syms)
    data = data.assign(intercept = 1.) # required by statsmodels OLS
    
    def sliding_windows(x, window):
        """Create rolling/sliding windows of length ~window~.
    
        Given an array of shape (y, z), it will return "blocks" of shape
        (x - window + 1, window, z)."""
    
        return np.array([x[i:i + window] for i 
                        in range(0, x.shape[0] - window + 1)])
    
    data.head(3)
    Out[33]: 
                     usd  term_spread    copper  intercept
    DATE                                                  
    2000-02-01  0.012573    -1.409091 -0.019972        1.0
    2000-03-01 -0.000079     2.000000 -0.037202        1.0
    2000-04-01  0.005642     0.518519 -0.033275        1.0
    
    window = 36
    wins = sliding_windows(data.values, window=window)
    y, x = wins[:, :, 0], wins[:, :, 1:]
    
    coefs = []
    
    for endog, exog in zip(y, x):
        model = smf.OLS(endog, exog).fit()
            # The full set of model attributes gets lost with each loop
        coefs.append(model.params)
    
    df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,
                      index=data.index[window - 1:])
    
    df.head(3) # rolling 36m coefficients
    Out[70]: 
                term_spread    copper  intercept
    DATE                                        
    2003-01-01    -0.000122 -0.018426   0.001937
    2003-02-01     0.000391 -0.015740   0.001597
    2003-03-01     0.000655 -0.016811   0.001546
    
  • Josef
    Josef almost 7 years
    The main problem with this approach is that keeping all OLS instances around requires a lot of memory.
  • Brad Solomon
    Brad Solomon over 6 years
    @user333700 I've made some major changes that should be much kinder to memory if you're interested in taking a look.