Keras: How is Accuracy Calculated for Multi-Label Classification?

python machine-learning keras

16,511

Solution 1

For multi-label classification, I think it is correct to use sigmoid as the activation and binary_crossentropy as the loss.

If the output is sparse multi-label, meaning a few positive labels and a majority are negative labels, the Keras accuracy metric will be overflatted by the correctly predicted negative labels. If I remember correctly, Keras does not choose the label with the highest probability. Instead, for binary classification, the threshold is 50%. So the prediction would be [0, 0, 0, 0, 0, 1]. And if the actual labels were [0, 0, 0, 0, 0, 0], the accuracy would be 5/6. You can test this hypothesis by creating a model that always predicts negative label and look at the accuracy.

If that's indeed the case, you may try a different metric such as top_k_categorical_accuracy.

Another remote possibility I can think of is your training data. Are the labels y somehow "leaked" into x? Just a wild guess.

Solution 2

You can refer to Keras Metrics documentation to see all metrics available (e.g. binary_accuracy). You can also create your own custom metric (and make sure it does exactly what you expect). I wanted to make sure neurite was right about how the accuracy is computed, so this is what I did (note: activation="sigmoid") :

from keras.metrics import binary_accuracy
def custom_acc(y_true, y_pred):
    return binary_accuracy(y_true, y_pred)

# ...

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[
    "accuracy",
    "binary_accuracy",
    "categorical_accuracy",
    "sparse_categorical_accuracy",
    custom_acc
])

Running the training you will see that the custom_acc is always equal to the binary_accuracy (and therefore to the custom_acc).

Now you can refer to the Keras code on Github to see how it is computed:

K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)

Which confirm what neurite said (i.e. If the prediction is [0, 0, 0, 0, 0, 1] and the actual labels were [0, 0, 0, 0, 0, 0], the accuracy would be 5/6).

16,511

Author by

anon_swe

Updated on June 12, 2022

Comments

anon_swe almost 2 years
I'm doing the Toxic Comment Text Classification Kaggle challenge. There are 6 classes: ['threat', 'severe_toxic', 'obscene', 'insult', 'identity_hate', 'toxic']. A comment can be multiple of these classes so it's a multi-label classification problem.

I built a basic neural network with Keras as follows:
```
model = Sequential()
model.add(Embedding(10000, 128, input_length=250))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(len(classes), activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```
I run this line:
```
model.fit(X_train, train_y, validation_split=0.5, epochs=3)
```
and get 99.11% accuracy after 3 epochs.

However, 99.11% accuracy is a good bit higher than the best Kaggle submission. This makes me think I'm either (possibly both) a) overfitting or b) misusing Keras's accuracy.

1) Seems a bit hard to overfit when I'm using 50% of my data as a validation split and only 3 epochs.

2) Is accuracy here just the percentage of the time the model gets each class correct?

So if I output [0, 0, 0, 0, 0, 1] and the correct output was [0, 0, 0, 0, 0, 0], my accuracy would be 5/6?

After a bit of thought, I sort of think the accuracy metric here is just looking at the class my model predicts with highest confidence and comparing vs. ground truth.

So if my model outputs [0, 0, 0.9, 0, 0, 0], it will compare the class at index 2 ('obscene') with the true value. Do you think this is what's happening?

Thanks for any help you can offer!
smichaud almost 5 years

I added all potential metrics to determine which one was used and compare the results. It was only experimental.
CMCDragonkai over 4 years

There is a weighted_binary_crossentropy loss function defined here for multi-label problems where there is a lot of negative predictions: stats.stackexchange.com/a/313922/198729. However I'm also looking for a weighted_binary_accuracy as well.
CMCDragonkai over 4 years

Also the top_k_categorical_accuracy doesn't seem to work in this case as what is the top_k for the truths if the truths are multi-hot encoded?