scikit learn output metrics.classification_report into CSV/tab-delimited format

62,695

Solution 1

As of scikit-learn v0.20, the easiest way to convert a classification report to a pandas Dataframe is by simply having the report returned as a dict:

report = classification_report(y_test, y_pred, output_dict=True)

and then construct a Dataframe and transpose it:

df = pandas.DataFrame(report).transpose()

From here on, you are free to use the standard pandas methods to generate your desired output formats (CSV, HTML, LaTeX, ...).

See the documentation.

Solution 2

If you want the individual scores this should do the job just fine.

import pandas as pd

def classification_report_csv(report):
    report_data = []
    lines = report.split('\n')
    for line in lines[2:-3]:
        row = {}
        row_data = line.split('      ')
        row['class'] = row_data[0]
        row['precision'] = float(row_data[1])
        row['recall'] = float(row_data[2])
        row['f1_score'] = float(row_data[3])
        row['support'] = float(row_data[4])
        report_data.append(row)
    dataframe = pd.DataFrame.from_dict(report_data)
    dataframe.to_csv('classification_report.csv', index = False)

report = classification_report(y_true, y_pred)
classification_report_csv(report)

Solution 3

We can get the actual values from the precision_recall_fscore_support function and then put them into data frames. the below code will give the same result, but now in a pandas dataframe:

clf_rep = metrics.precision_recall_fscore_support(true, pred)
out_dict = {
             "precision" :clf_rep[0].round(2)
            ,"recall" : clf_rep[1].round(2)
            ,"f1-score" : clf_rep[2].round(2)
            ,"support" : clf_rep[3]
            }
out_df = pd.DataFrame(out_dict, index = nb.classes_)
avg_tot = (out_df.apply(lambda x: round(x.mean(), 2) if x.name!="support" else  round(x.sum(), 2)).to_frame().T)
avg_tot.index = ["avg/total"]
out_df = out_df.append(avg_tot)
print out_df

Solution 4

Just import pandas as pd and make sure that you set the output_dict parameter which by default is False to True when computing the classification_report. This will result in an classification_report dictionary which you can then pass to a pandas DataFrame method. You may want to transpose the resulting DataFrame to fit the fit the output format that you want. The resulting DataFrame may then be written to a csv file as you wish.

clsf_report = pd.DataFrame(classification_report(y_true = your_y_true, y_pred = your_y_preds5, output_dict=True)).transpose()
clsf_report.to_csv('Your Classification Report Name.csv', index= True)

Solution 5

While the previous answers are probably all working I found them a bit verbose. The following stores the individual class results as well as the summary line in a single dataframe. Not very sensitive to changes in the report but did the trick for me.

#init snippet and fake data
from io import StringIO
import re
import pandas as pd
from sklearn import metrics
true_label = [1,1,2,2,3,3]
pred_label = [1,2,2,3,3,1]

def report_to_df(report):
    report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
    report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)        
    return(report_df)

#txt report to df
report = metrics.classification_report(true_label, pred_label)
report_df = report_to_df(report)

#store, print, copy...
print (report_df)

Which gives the desired output:

Classes precision   recall  f1-score    support
1   0.5 0.5 0.5 2
2   0.5 0.5 0.5 2
3   0.5 0.5 0.5 2
avg/total   0.5 0.5 0.5 6
Share:
62,695
Seun AJAO
Author by

Seun AJAO

Updated on July 09, 2022

Comments

  • Seun AJAO
    Seun AJAO almost 2 years

    I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model

    from __future__ import print_function
    
    # Read **`file.csv`** into a pandas DataFrame
    
    import pandas as pd
    path = 'data/file.csv'
    merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)
    
    # define X and y using the original DataFrame
    X = merged.text
    y = merged.grid
    
    # split X and y into training and testing sets;
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # import and instantiate CountVectorizer
    from sklearn.feature_extraction.text import CountVectorizer
    vect = CountVectorizer()
    
    # create document-term matrices using CountVectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # import and instantiate MultinomialNB
    from sklearn.naive_bayes import MultinomialNB
    nb = MultinomialNB()
    
    # fit a Multinomial Naive Bayes model
    nb.fit(X_train_dtm, y_train)
    
    # make class predictions
    y_pred_class = nb.predict(X_test_dtm)
    
    # generate classification report
    from sklearn import metrics
    print(metrics.classification_report(y_test, y_pred_class))
    

    And a simplified output of the metrics.classification_report on command line screen looks like this:

                 precision  recall   f1-score   support
         12       0.84      0.48      0.61      2843
         13       0.00      0.00      0.00        69
         15       1.00      0.19      0.32       232
         16       0.75      0.02      0.05       965
         33       1.00      0.04      0.07       155
          4       0.59      0.34      0.43      5600
         41       0.63      0.49      0.55      6218
         42       0.00      0.00      0.00       102
         49       0.00      0.00      0.00        11
          5       0.90      0.06      0.12      2010
         50       0.00      0.00      0.00         5
         51       0.96      0.07      0.13      1267
         58       1.00      0.01      0.02       180
         59       0.37      0.80      0.51      8127
          7       0.91      0.05      0.10       579
          8       0.50      0.56      0.53      7555      
        avg/total 0.59      0.48      0.45     35919
    

    I was wondering if there was any way to get the report output into a standard csv file with regular column headers

    When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet - Openoffice Calc or Excel, It lumps the results in one column. Looking like this:

    enter image description here

  • Seun AJAO
    Seun AJAO over 7 years
    thanks I have tried to use a data frame; Result = metrics.classification_report(y_test, y_pred_class); df = pd.DataFrame(Result); df.to_csv(results.csv, sep='\t') but got an error pandas.core.common.PandasError: DataFrame constructor not properly called!
  • user3806649
    user3806649 over 6 years
    row['precision'] = float(row_data[1]) ValueError: could not convert string to float:
  • Flynamic
    Flynamic almost 6 years
    The averages calculated by classification_report are weighted with the support values.
  • Flynamic
    Flynamic almost 6 years
    So it should be avg = (class_report_df.loc[metrics_sum_index[:-1]] * class_report_df.loc[metrics_sum_index[-1]]).sum(axis=1) / total
  • Raul
    Raul almost 6 years
    Nice catch @Flynamic! I figured it out that precision_recall_fscore_support has an average param. which does just what you suggest!
  • RomaneG
    RomaneG almost 6 years
    change line row_data = line.split(' ') by row_data = line.split(' ') row_data = list(filter(None, row_data))
  • Jack Fleeting
    Jack Fleeting about 5 years
    The line row['support'] = int(row_data[5]) raises IndexError: list index out of range
  • Jack Fleeting
    Jack Fleeting about 5 years
    This works, but trying to use the labels parameter of precision_recall_fscore_support raises, for some reason, ValueError: y contains previously unseen labels
  • Ting Jia
    Ting Jia about 5 years
    Really cool ,and thanks~ And I make a comment for the split statement: row_data = line.split(' ') , this one should be better like this : row_data = line.split(), because some time the space number in the report string is not equal
  • Satheesh K
    Satheesh K over 4 years
    Better to replace row_data = line.split(' ') with row_data = ' '.join(line.split()) row_data = row_data.split(' ') to account for irregular spaces.
  • Prashant Saraswat
    Prashant Saraswat over 3 years
    df.to_csv('file_name.csv') for the lazy :)
  • piedpiper
    piedpiper about 2 years
    Perfect answer. Minor note: since the output dict accuracy has only one value, it will be repeated in the accuracy row of your dataframe. If you want your export to mirror the sklearn output exactly, you can use the snippet below. report.update({"accuracy": {"precision": None, "recall": None, "f1-score": report["accuracy"], "support": report['macro avg']['support']}})