Pandas Dataframe AttributeError: 'DataFrame' object has no attribute 'design_info'

26,114

Pickling and unpickling of a pandas DataFrame doesn't save and restore attributes that have been attached by a user, as far as I know.

Since the formula information is currently stored together with the DataFrame of the original design matrix, this information is lost after unpickling a Results and Model instance.

If you don't use categorical variables and transformations, then the correct designmatrix can be built with patsy.dmatrix. I think the following should work

x = patsy.dmatrix("B + C", data=df)  # df is data for prediction
test2 = model.predict(x, transform=False)

or constructing the design matrix for the prediction directly should also work Note we need to explicitly add a constant that the formula adds by default.

from statsmodels.api import add_constant
test2 = model.predict(add_constant(df[["B", "C"]]), transform=False)

If the formula and design matrix contain (stateful) transformation and categorical variables, then it's not possible to conveniently construct the design matrix without the original formula information. Constructing it by hand and doing all the calculations explicitly is difficult in this case, and looses all the advantages of using formulas.

The only real solution is to pickle the formula information design_info independently of the dataframe orig_exog.

Share:
26,114
Michael
Author by

Michael

Updated on December 22, 2020

Comments

  • Michael
    Michael over 3 years

    I am trying to use the predict() function of the statsmodels.formula.api OLS implementation. When I pass a new data frame to the function to get predicted values for an out-of-sample dataset result.predict(newdf) returns the following error: 'DataFrame' object has no attribute 'design_info'. What does this mean and how do I fix it? The full traceback is:

        p = result.predict(newdf)
      File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 878, in predict
        exog = dmatrix(self.model.data.orig_exog.design_info.builder,
      File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2088, in __getattr__
        (type(self).__name__, name))
    AttributeError: 'DataFrame' object has no attribute 'design_info'
    

    EDIT: Here is a reproducible example. The error appears to occur when I pickle and then unpickle the result object (which I need to do in my actual project):

    import cPickle
    import pandas as pd
    import numpy as np
    import statsmodels.formula.api as sm
    
    df = pd.DataFrame({"A": [10,20,30,324,2353], "B": [20, 30, 10, 1, 2332], "C": [0, -30, 120, 11, 2]})
    
    result = sm.ols(formula="A ~ B + C", data=df).fit()
    print result.summary()
    
    test1 = result.predict(df) #works
    
    f_myfile = open('resultobject', "wb")
    cPickle.dump(result, f_myfile, 2)
    f_myfile.close()
    print("Result Object Saved")
    
    
    f_myfile = open('resultobject', "rb")
    model = cPickle.load(f_myfile)
    
    test2 = model.predict(df) #produces error
    
  • Josef
    Josef over 10 years
    I opened an issue with statsmodels github.com/statsmodels/statsmodels/issues/1263
  • Michael
    Michael over 10 years
    Solution 1 produces the same error in the sample code. Solution 2 gives ValueError: matrices are not aligned again with the sample code.
  • Josef
    Josef over 10 years
    I fixed both examples, in the first I forgot to add transform=False to avoid calling patsy, in the second example I just forgot to add the constant that patsy adds automatically.