Keras: How is Accuracy Calculated for Multi-Label Classification?

16,511

Solution 1

For multi-label classification, I think it is correct to use sigmoid as the activation and binary_crossentropy as the loss.

If the output is sparse multi-label, meaning a few positive labels and a majority are negative labels, the Keras accuracy metric will be overflatted by the correctly predicted negative labels. If I remember correctly, Keras does not choose the label with the highest probability. Instead, for binary classification, the threshold is 50%. So the prediction would be [0, 0, 0, 0, 0, 1]. And if the actual labels were [0, 0, 0, 0, 0, 0], the accuracy would be 5/6. You can test this hypothesis by creating a model that always predicts negative label and look at the accuracy.

If that's indeed the case, you may try a different metric such as top_k_categorical_accuracy.

Another remote possibility I can think of is your training data. Are the labels y somehow "leaked" into x? Just a wild guess.

Solution 2

You can refer to Keras Metrics documentation to see all metrics available (e.g. binary_accuracy). You can also create your own custom metric (and make sure it does exactly what you expect). I wanted to make sure neurite was right about how the accuracy is computed, so this is what I did (note: activation="sigmoid") :

from keras.metrics import binary_accuracy
def custom_acc(y_true, y_pred):
    return binary_accuracy(y_true, y_pred)

# ...

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[
    "accuracy",
    "binary_accuracy",
    "categorical_accuracy",
    "sparse_categorical_accuracy",
    custom_acc
])

Running the training you will see that the custom_acc is always equal to the binary_accuracy (and therefore to the custom_acc).

Now you can refer to the Keras code on Github to see how it is computed:

K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)

Which confirm what neurite said (i.e. If the prediction is [0, 0, 0, 0, 0, 1] and the actual labels were [0, 0, 0, 0, 0, 0], the accuracy would be 5/6).

Share:
16,511
anon_swe
Author by

anon_swe

Updated on June 12, 2022

Comments

  • anon_swe
    anon_swe almost 2 years

    I'm doing the Toxic Comment Text Classification Kaggle challenge. There are 6 classes: ['threat', 'severe_toxic', 'obscene', 'insult', 'identity_hate', 'toxic']. A comment can be multiple of these classes so it's a multi-label classification problem.

    I built a basic neural network with Keras as follows:

    model = Sequential()
    model.add(Embedding(10000, 128, input_length=250))
    model.add(Flatten())
    model.add(Dense(100, activation='relu'))
    model.add(Dense(len(classes), activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    

    I run this line:

    model.fit(X_train, train_y, validation_split=0.5, epochs=3)
    

    and get 99.11% accuracy after 3 epochs.

    However, 99.11% accuracy is a good bit higher than the best Kaggle submission. This makes me think I'm either (possibly both) a) overfitting or b) misusing Keras's accuracy.

    1) Seems a bit hard to overfit when I'm using 50% of my data as a validation split and only 3 epochs.

    2) Is accuracy here just the percentage of the time the model gets each class correct?

    So if I output [0, 0, 0, 0, 0, 1] and the correct output was [0, 0, 0, 0, 0, 0], my accuracy would be 5/6?

    After a bit of thought, I sort of think the accuracy metric here is just looking at the class my model predicts with highest confidence and comparing vs. ground truth.

    So if my model outputs [0, 0, 0.9, 0, 0, 0], it will compare the class at index 2 ('obscene') with the true value. Do you think this is what's happening?

    Thanks for any help you can offer!

  • smichaud
    smichaud almost 5 years
    I added all potential metrics to determine which one was used and compare the results. It was only experimental.
  • CMCDragonkai
    CMCDragonkai over 4 years
    There is a weighted_binary_crossentropy loss function defined here for multi-label problems where there is a lot of negative predictions: stats.stackexchange.com/a/313922/198729. However I'm also looking for a weighted_binary_accuracy as well.
  • CMCDragonkai
    CMCDragonkai over 4 years
    Also the top_k_categorical_accuracy doesn't seem to work in this case as what is the top_k for the truths if the truths are multi-hot encoded?