What is a bad, decent, good, and excellent F1-measure range?

performance machine-learning precision measurement precision-recall

42,471

Solution 1

Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.

In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:

Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:

I consider these to be bad f1-scores, given the balanced dataset.

PS. summary generated using sklearn.metrics.classification_report

Solution 2

You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.

Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.

So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.

f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

42,471

Author by

KubiK888

Updated on February 24, 2021

Comments

KubiK888 about 3 years

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.