Neural Network: Mysterious ReLu
Prediction distribution
After playing around with it for a while, I decided to visualize the actual prediction distribution for both models:
predicted_distribution = tf.nn.softmax(logits, name='distribution')
Below are the histograms of the distributions and how they evolved over time.
With ReLu (wrong model)
Without ReLu (correct model)
The first histogram makes sense, most of probabilities are close to 0
.
But the histogram of the ReLu model is suspicious: the values seem to concentrate around 0.15
after few iterations. Printing the actual predictions confirmed this idea:
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
I had 7 classes (for 7 different languages at that moment) and 0.14286
is 1/7
. It turns out, the "perfect" model learned to output
0
logits, which in turn translated in uniform prediction.
But how can this distribution be reported as 99% accurate?
tf.nn.in_top_k
Before diving into tf.nn.in_top_k
I checked an alternative way to compute accuracy:
true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))
... which performs honest comparison of the highest predicted class and the ground truth. The result is this:
iteration=2 loss=3.992 train-acc=0.13086 train-alt-acc=0.13086
iteration=4 loss=3.590 train-acc=0.13086 train-alt-acc=0.12207
iteration=6 loss=2.871 train-acc=0.21777 train-alt-acc=0.13672
iteration=8 loss=2.466 train-acc=0.37695 train-alt-acc=0.16211
iteration=10 loss=2.099 train-acc=0.62305 train-alt-acc=0.10742
iteration=12 loss=2.066 train-acc=0.79980 train-alt-acc=0.17090
iteration=14 loss=2.016 train-acc=0.84277 train-alt-acc=0.17285
iteration=16 loss=1.954 train-acc=0.91309 train-alt-acc=0.13574
iteration=18 loss=1.956 train-acc=0.95508 train-alt-acc=0.06445
iteration=20 loss=1.923 train-acc=0.97754 train-alt-acc=0.11328
Indeed, tf.nn.in_top_k
with k=1
diverged from the right accuracy quickly and began to report fantasized 99% values.
So what does it do actually? Here's what the documentation
says about it:
Says whether the targets are in the top K predictions.
This outputs a
batch_size
bool array, an entryout[i]
is true if the prediction for the target class is among the top k predictions among all predictions for example i. Note that the behavior ofInTopK
differs from theTopK
op in its handling of ties; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k.
That's what it is. If the probabilities are uniform (which actually means "I have no idea"), they are all correct. The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example:
x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)
... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k
spec.
Conclusion
tf.nn.in_top_k
is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions
and report them as "correct". Instead, you should always use this long but trusted expression:
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))
Comments
-
Maxim almost 2 years
I've been building a programming language detector, i.e., a classifier of code snippets, as part of a bigger project. My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens, and make a simple NN on top of these features.
The input to NN is a fixed-length array of counters of most distinctive tokens, such as
"def"
,"self"
,"function"
,"->"
,"const"
,"#include"
, etc., that are automatically extracted from the corpus. The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get high accuracy score.Input: def 1 for 2 in 2 True 1 ): 3 ,: 1 ... Output: python
Setup
I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. Here's the model (a full runnable script is here):
# Placeholders x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x') y = tf.placeholder(shape=[None], dtype=tf.int32, name='y') training = tf.placeholder_with_default(False, shape=[], name='training') # One hidden layer with dropout reg = tf.contrib.layers.l2_regularizer(0.01) hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg, activation=tf.nn.elu, name='hidden1') dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1') # Output layer logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg, activation=tf.nn.relu, name='logits') # Cross-entropy loss loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y)) # Misc reports: accuracy, correct/misclassified samples, etc. correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k') prediction = tf.argmax(logits, axis=1) wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k') x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified') accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')
The output is pretty encouraging:
iteration=5 loss=2.580 train-acc=0.34277 iteration=10 loss=2.029 train-acc=0.69434 iteration=15 loss=2.054 train-acc=0.92383 iteration=20 loss=1.934 train-acc=0.98926 iteration=25 loss=1.942 train-acc=0.99609 Files.VAL mean accuracy = 0.99121 <-- After just 1 epoch! iteration=30 loss=1.943 train-acc=0.99414 iteration=35 loss=1.947 train-acc=0.99512 iteration=40 loss=1.946 train-acc=0.99707 iteration=45 loss=1.946 train-acc=0.99609 iteration=50 loss=1.944 train-acc=0.99902 iteration=55 loss=1.946 train-acc=0.99902 Files.VAL mean accuracy = 0.99414
Test accuracy was also around 1.0. Everything looked perfect.
Mysterious ReLu
But then I noticed that I put
activation=tf.nn.relu
into the final dense layer (logits
), which is clearly a bug: there is no need to discard negative scores beforesoftmax
, because they indicate the classes with low probability. Zero threshold will only make these classes artificially more probable, which would be a mistake. Getting rid of it should only make the model more robust and confident in the correct class.That's what I thought. So I replaced it with
activation=None
, run the model again and then a surprising thing happened: the performance didn't improve. At all. In fact, it degraded significantly:iteration=5 loss=5.236 train-acc=0.16602 iteration=10 loss=4.068 train-acc=0.18750 iteration=15 loss=3.110 train-acc=0.37402 iteration=20 loss=5.149 train-acc=0.14844 iteration=25 loss=2.880 train-acc=0.18262 Files.VAL mean accuracy = 0.28711 iteration=30 loss=3.136 train-acc=0.25781 iteration=35 loss=2.916 train-acc=0.22852 iteration=40 loss=2.156 train-acc=0.39062 iteration=45 loss=1.777 train-acc=0.45312 iteration=50 loss=2.726 train-acc=0.33105 Files.VAL mean accuracy = 0.29362
The accuracy got better with training, but never surpassed 91-92%. I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs. According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally.
How is this possible? How can the final ReLu make a model so much superior? Especially if this ReLu is a bug?