Python 2.7 - statsmodels - formatting and writing summary output

21,810

Solution 1

There is no premade table of parameters and their result statistics currently available.

Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.

for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use

res_all = []
for res in results:
    low, upp = res.confint().T   # unpack columns 
    res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues, 
                   low, upp)))

But it might be better to align with pandas, depending on what structure you have across models.

You could write a helper function that takes all the results from the results instance and concatenates them in a row.

(I'm not sure what's the most convenient for writing to csv by rows)

edit:

Here is an example storing the regression results in a dataframe

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21

the loop is on line 159.

summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables.

Solution 2

write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
    f.write(result.summary().as_csv())

Solution 3

  • results.params : for coefficient
  • results.pvalues : for p-values

BTW you can use dir(results) to find out all the attribute of an object

Solution 4

I found this formulation to be a little more straightforward. You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).

import pandas as pd     #This can be left out if already present...

def results_summary_to_dataframe(results):
    '''This takes the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df

Solution 5

There is actually a built-in method documented in the documentation here:

f = open('csvfile.csv','w')
f.write(result.summary().as_csv())
f.close()

I believe this is a much easier (and clean) way to output the summaries to csv files.

Share:
21,810
DMML
Author by

DMML

Updated on December 31, 2020

Comments

  • DMML
    DMML over 3 years

    I'm doing logistic regression using pandas 0.11.0(data handling) and statsmodels 0.4.3 to do the actual regression, on Mac OSX Lion.

    I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.

    Currently, I'm only aware of doing print result.summary() which prints the results (as follows) to the shell:

     Logit Regression Results                           
      ==============================================================================
     Dep. Variable:            death_death   No. Observations:                 9752
     Model:                          Logit   Df Residuals:                     9747
     Method:                           MLE   Df Model:                            4
     Date:                Wed, 22 May 2013   Pseudo R-squ.:                -0.02672
     Time:                        22:15:05   Log-Likelihood:                -5806.9
     converged:                       True   LL-Null:                       -5655.8
                                             LLR p-value:                     1.000
     ===============================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
     -------------------------------------------------------------------------------
     age_age5064    -0.1999      0.055     -3.619      0.000        -0.308    -0.092
     age_age6574    -0.2553      0.053     -4.847      0.000        -0.359    -0.152
     sex_female     -0.2515      0.044     -5.765      0.000        -0.337    -0.166
     stage_early    -0.1838      0.041     -4.528      0.000        -0.263    -0.104
     access         -0.0102      0.001    -16.381      0.000        -0.011    -0.009
     ===============================================================================
    

    I will also need the odds ratio, which is computed by print np.exp(result.params), and is printed in the shell as such:

    age_age5064    0.818842
    age_age6574    0.774648
    sex_female     0.777667
    stage_early    0.832098
    access         0.989859
    dtype: float64
    

    What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood, but have included it for the sake of thoroughness):

    `Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`
    

    I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.

    I am familiar with the csv module in Python, and am becoming more familiar with pandas. Not sure whether this info could be formatted and stored in a pandas dataframe and then written, using to_csv to a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module).

    UPDATE:

    So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!

    Here is the site where the classes are laid out: statsmodels results class