How to interpret almost perfect accuracy and AUC-ROC but zero f1-score, precision and recall

18,032

One must understand crucial difference between AUC ROC and "point-wise" metrics like accuracy/precision etc. ROC is a function of a threshold. Given a model (classifier) that outputs the probability of belonging to each class, we predict the class that has the highest probability (support). However, sometimes we can get better scores by changing this rule and requiring one support to be 2 times bigger than the other to actually classify as a given class. This is often true for imbalanced datasets. This way you are actually modifying the learned prior of classes to better fit your data. ROC looks at "what would happen if I change this threshold to all possible values" and then AUC ROC computes the integral of such a curve.

Consequently:

  • high AUC ROC vs low f1 or other "point" metric, means that your classifier currently does a bad job, however you can find the threshold for which its score is actually pretty decent
  • low AUC ROC and low f1 or other "point" metric, means that your classifier currently does a bad job, and even fitting a threshold will not change it
  • high AUC ROC and high f1 or other "point" metric, means that your classifier currently does a decent job, and for many other values of threshold it would do the same
  • low AUC ROC vs high f1 or other "point" metric, means that your classifier currently does a decent job, however for many other values of threshold - it is pretty bad
Share:
18,032

Related videos on Youtube

KubiK888
Author by

KubiK888

Updated on June 04, 2022

Comments

  • KubiK888
    KubiK888 almost 2 years

    I am training ML logistic classifier to classify two classes using python scikit-learn. They are in an extremely imbalanced data (about 14300:1). I'm getting almost 100% accuracy and ROC-AUC, but 0% in precision, recall, and f1 score. I understand that accuracy is usually not useful in very imbalanced data, but why is the ROC-AUC measure is close to perfect as well?

    from sklearn.metrics import roc_curve, auc
    
    # Get ROC 
    y_score = classifierUsed2.decision_function(X_test)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print 'AUC-'+'=',roc_auc
    
    1= class1
    0= class2
    Class count:
    0    199979
    1        21
    
    Accuracy: 0.99992
    Classification report:
                 precision    recall  f1-score   support
    
              0       1.00      1.00      1.00     99993
              1       0.00      0.00      0.00         7
    
    avg / total       1.00      1.00      1.00    100000
    
    Confusion matrix:
    [[99992     1]
     [    7     0]]
    AUC= 0.977116255281
    

    The above is using logistic regression, below is using decision tree, the decision matrix looks almost identical, but the AUC is a lot different.

    1= class1
    0= class2
    Class count:
    0    199979
    1        21
    Accuracy: 0.99987
    Classification report:
                 precision    recall  f1-score   support
    
              0       1.00      1.00      1.00     99989
              1       0.00      0.00      0.00        11
    
    avg / total       1.00      1.00      1.00    100000
    
    Confusion matrix:
    [[99987     2]
     [   11     0]]
    AUC= 0.4999899989
    
    • cel
      cel over 8 years
      you may want to give us the confusion matrix. Intuitively, I would guess that this is not possible, but I don't have the time to do the math right now.
    • KubiK888
      KubiK888 over 8 years
      Thanks for the suggestions, I have added the codes and results
    • cel
      cel over 8 years
      Interpretation: You do not have any predictability. You have basically no examples for class 1 and predict all wrong. You are better off simply predicting 0 all the time.
    • KubiK888
      KubiK888 over 8 years
      Yes I understand, but does this affect both accuracy and AUC-ROC measures? Or is my AUC-ROC calculation wrong?
    • cel
      cel over 8 years
      Yes, your model has high accuracy and high AUC, is that what you ask? But that's because almost all data in your test set are 0 and you basically predict only zeros. No, this does not show that your model is useful. I tried to give you an intuition for that. Compare the AUC and accuracy for a model that always predicts 0. Obviously this is not a useful model. But it will score better. This is due to the structure of the test set. Get a balanced test set and things will be much clearer.
    • KubiK888
      KubiK888 over 8 years
      I have tried testing using other classifier, and using the decision classifier, the confusion matrix looks almost the same, but the AUC this time is much lower (see edition).
    • Anatoly Alekseev
      Anatoly Alekseev over 6 years
      This is why I stopped using 'roc_auc' as a scoring function for my optimizers and resorted to 'f_1' (yes, my datasets are often imbalanced). But currently i'm thinking of custom scorer which is a mix of roc_auc and f1 in a sense that standard roc_auc is heavily penalized when at least one of classes has zero f1.
  • KubiK888
    KubiK888 over 8 years
    What is considered high or low f-1 scores? Is 50% decent or bad?
  • lejlot
    lejlot over 8 years
    depends on the problem at hand, but it does not seem good. f1 is a harmonic mean between precision and recall, thus it more or less translates to the scale of both (as it is always in between these two values). I would say that scores below 0.6 are rarely acceptable.
  • KubiK888
    KubiK888 over 8 years
    I have done some undersampling since (1:1 ratio), the precision, recall, and f-score measures drastically improved (for example f1 from 0.44 to 0.93), I wonder which result I should rely on more? The original distribution more resemble the real-world distribution, while the undersampling makes sense but doesn't it become so distant from the original distribution it becomes unrepresentative?
  • lejlot
    lejlot over 8 years
    you can't measure a metric on undersampled data. You only train on resampled one - you have to test on real (with actual priors) ones
  • KubiK888
    KubiK888 over 8 years
    I see that makes sense, but let's say, it does perform much better in test set (which is original distribution), can I say this is a good classifier and that I should rely my results on this undersampling-trained classifier?
  • lejlot
    lejlot over 8 years
    As long as your test set is big enough to represent actual data - yes, it does not matter how did you built the classifier. If test data was not used in any way to do so, and it was big enough - it is the evidence of classifier strength
  • Mohammadreza
    Mohammadreza almost 6 years
    I came across a case in which Classifier1 reports F1 = 80 and AUC-ROC = 70. Classifier2 reports F1 = 77 and AUC-ROC = 71. Which one is the better model to go with? Thanks!
  • Jjang
    Jjang almost 4 years
    Excellent explanation, kudos
  • xm1
    xm1 over 3 years
    @lejlot, once AUC does not show the good cutting point, is it better to use F1 and weights for tuning?