Fixed effect in Pandas or Statsmodels
Solution 1
As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options:
If you use Python 3 you can use
linearmodels
as specified in the more recent answer: https://stackoverflow.com/a/44836199/3435183Just specify various dummies in your
statsmodels
specification, e.g. usingpd.get_dummies
. May not be feasible if the number of fixed effects is large.-
Or do some groupby based demeaning and then use
statsmodels
(this would work if you're estimating lots of fixed effects). Here is a barebones version of what you could do for one way fixed effects:import statsmodels.api as sm import statsmodels.formula.api as smf import patsy def areg(formula,data=None,absorb=None,cluster=None): y,X = patsy.dmatrices(formula,data,return_type='dataframe') ybar = y.mean() y = y - y.groupby(data[absorb]).transform('mean') + ybar Xbar = X.mean() X = X - X.groupby(data[absorb]).transform('mean') + Xbar reg = sm.OLS(y,X) # Account for df loss from FE transform reg.df_resid -= (data[absorb].nunique() - 1) return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})
For example, suppose you have a panel of stock data: stock returns and other stock data for all stocks, every month over a number of months and you want to regress returns on lagged returns with calendar month fixed effects (where the calender month variable is called caldt
) and you also want to cluster the standard errors by calendar month. You can estimate such a fixed effect model with the following:
reg0 = areg('ret~retlag',data=df,absorb='caldt',cluster='caldt')
And here is what you can do if using an older version of Pandas
:
An example with time fixed effects using pandas' PanelOLS
(which is in the plm module). Notice, the import of PanelOLS
:
>>> from pandas.stats.plm import PanelOLS
>>> df
y x
date id
2012-01-01 1 0.1 0.2
2 0.3 0.5
3 0.4 0.8
4 0.0 0.2
2012-02-01 1 0.2 0.7
2 0.4 0.5
3 0.2 0.3
4 0.1 0.1
2012-03-01 1 0.6 0.9
2 0.7 0.5
3 0.9 0.6
4 0.4 0.5
Note, the dataframe must have a multindex set ; panelOLS
determines the time
and entity
effects based on the index:
>>> reg = PanelOLS(y=df['y'],x=df[['x']],time_effects=True)
>>> reg
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x>
Number of Observations: 12
Number of Degrees of Freedom: 4
R-squared: 0.2729
Adj R-squared: 0.0002
Rmse: 0.1588
F-stat (1, 8): 1.0007, p-value: 0.3464
Degrees of Freedom: model 3, resid 8
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.3694 0.2132 1.73 0.1214 -0.0485 0.7872
---------------------------------End of Summary---------------------------------
Docstring:
PanelOLS(self, y, x, weights = None, intercept = True, nw_lags = None,
entity_effects = False, time_effects = False, x_effects = None,
cluster = None, dropped_dummies = None, verbose = False,
nw_overlap = False)
Implements panel OLS.
See ols function docs
This is another function (like fama_macbeth
) where I believe the plan is to move this functionality to statsmodels
.
Solution 2
There is a package called linearmodels
(https://pypi.org/project/linearmodels/) that has a fairly complete fixed effects and random effects implementation including clustered standard errors. It does not use high-dimensional OLS to eliminate effects and so can be used with large data sets.
# Outer is entity, inner is time
entity = list(map(chr,range(65,91)))
time = list(pd.date_range('1-1-2014',freq='A', periods=4))
index = pd.MultiIndex.from_product([entity, time])
df = pd.DataFrame(np.random.randn(26*4, 2),index=index, columns=['y','x'])
from linearmodels.panel import PanelOLS
mod = PanelOLS(df.y, df.x, entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)
This produces the following output:
PanelOLS Estimation Summary
================================================================================
Dep. Variable: y R-squared: 0.0029
Estimator: PanelOLS R-squared (Between): -0.0109
No. Observations: 104 R-squared (Within): 0.0029
Date: Thu, Jun 29 2017 R-squared (Overall): -0.0007
Time: 23:52:28 Log-likelihood -125.69
Cov. Estimator: Clustered
F-statistic: 0.2256
Entities: 26 P-value 0.6362
Avg Obs: 4.0000 Distribution: F(1,77)
Min Obs: 4.0000
Max Obs: 4.0000 F-statistic (robust): 0.1784
P-value 0.6739
Time periods: 4 Distribution: F(1,77)
Avg Obs: 26.000
Min Obs: 26.000
Max Obs: 26.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
x 0.0573 0.1356 0.4224 0.6739 -0.2127 0.3273
==============================================================================
F-test for Poolability: 1.0903
P-value: 0.3739
Distribution: F(25,77)
Included effects: Entity
It also has a formula interface which is similar to statsmodels,
mod = PanelOLS.from_formula('y ~ x + EntityEffects', df)
Related videos on Youtube
user3576212
Updated on October 07, 2020Comments
-
user3576212 over 3 years
Is there an existing function to estimate fixed effect (one-way or two-way) from Pandas or Statsmodels.
There used to be a function in Statsmodels but it seems discontinued. And in Pandas, there is something called
plm
, but I can't import it or run it usingpd.plm()
.-
ely almost 10 yearsSince fixed effects is fully equivalent to OLS with properly demeaned target variables, why don't you just do the demeaning first and then run OLS, like this set of examples? I hope this is for some assignment or something though, because as a Bayesian it makes sad since every time someone uses fixed effects an angel loses its wings.
-
ely almost 10 years@user3576212 That is unfortunate. It is very common in certain segments of social science, especially psychology and economics, that students are told to use techniques like fixed effects, but they never learn the real theory behind it. These methods are deeply flawed when used in real world settings and should never be used blindly as part of a software package, at least not until you have mastered the real theory behind it. You may find more help asking over at Cross-Validated.
-
ely almost 10 yearsYou're free to use whatever tools you want. I'm just saying that working in finance doing quant research has made me appreciate the criticisms of these methods more. They are not good for solving precisely the problems they are claimed to solve (such as cross-sectional correlation). It's similar with other very bad methods, like Fama-Macbeth regression. I'm not talking about anything academic, just applied econ research.
-
-
Josef almost 10 yearsIf you use the time index or group index
id
as a categorical variable in a formula for statsmodels ols, then it creates the fixed effects dummies for you. However, removing the fixed effects by demeaning is not yet supported. -
user3576212 almost 10 years@Karl D. Thanks a lot, your answers are always very useful!
-
petobens about 9 yearsCan I use random effects with pandas? I'm looking for something similar to stata's
xtreg, re
. Thanks! -
user3820991 over 6 yearsCorrect answer should be changed to this, because PanelOLS has been droped from pandas in 0.20 and I also cannot find it in statsmodels. bashtage.github.io/linearmodels/doc/panel/pandas.html
-
istewart over 6 yearsBuyer beware: linearmodels requires Python 3.
-
cadama over 5 yearsAlso, it does not make out of sample predictions. You have to code that yourself.
-
Karl D. over 5 yearsI haven't looked at their code but I imagine that
linearmodels
is approaching fixed effects like I do in option #3 in my outline above. That's going to be pretty efficient because it avoids doing matrix decomposition with very large matrices filled with dummy variables. -
TiTo almost 4 yearsI'm not sure I understand the function from the 3rd option correctly. I understand that for data I'd include a df with the DV, all IVs/contros and the clusterID. for cluster I'd include
cluster = 'clusterID'
. But what does theformula
andabsorb
part do? How do I make use of it? -
Karl D. almost 4 years
absorb
refers to the variable that contains the fixed effects: for example, a datetime column if you're estimating time fixed effects. The parameter naming comes for the areg function instata
and formula just refers to usingpatsy
formula notation for a regression (statsmodels uses that too) -
Max Ghenis almost 4 years
-
Jason Goal over 3 yearsHow to declare
entity
andtime
? i.e., how could this function know which variable is theentity
and which istime
? For those ran into this, check this:bashtage.github.io/linearmodels/panel/examples/…