Adam optimizer goes haywire after 200k batches, training loss grows

12,163

Solution 1

Yes. This is a known problem of Adam.

The equations for Adam are

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0. Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.

Solution 2

Yes this could be some sort of super complicated unstable numbers/equations case, but most certainty your training rate is just simply to high as your loss quickly decreases until 25K and then oscillates a lot in the same level. Try to decrease it by factor of 0.1 and see what happens. You should be able to reach even lower loss value.

Keep exploring! :)

Share:
12,163
sunside
Author by

sunside

Updated on June 07, 2022

Comments

  • sunside
    sunside almost 2 years

    I've been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training loss grows:

    Loss explodes

    The training data itself is randomized and spread across many .tfrecord files containing 1000 examples each, then shuffled again in the input stage and batched to 200 examples.

    The background

    I am designing a network that performs four different regression tasks at the same time, e.g. determining the likelihood of an object appearing in the image and simultanously determining its orientation. The network starts with a couple of convolutional layers, some with residual connections, and then branches into the four fully-connected segments.

    Since the first regression results in a probability, I'm using cross entropy for the loss, whereas the others use classical L2 distance. However, due to their nature, the probability loss is around the order of 0..1, while the orientation losses can be much larger, say 0..10. I already normalized both input and output values and use clipping

    normalized = tf.clip_by_average_norm(inferred.sin_cos, clip_norm=2.)
    

    in cases where things can get really bad.

    I've been (successfully) using the Adam optimizer to optimize on the tensor containing all distinct losses (rather than reduce_suming them), like so:

    reg_loss = tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
    loss = tf.pack([loss_probability, sin_cos_mse, magnitude_mse, pos_mse, reg_loss])
    
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                       epsilon=self.params.adam_epsilon)
    op_minimize = optimizer.minimize(loss, global_step=global_step)
    

    In order to display the results in TensorBoard, I then actually do

    loss_sum = tf.reduce_sum(loss)
    

    for a scalar summary.

    Adam is set to learning rate 1e-4 and epsilon 1e-4 (I see the same behavior with the default value for epislon and it breaks even faster when I keep the learning rate on 1e-3). Regularization also has no influence on this one, it does this sort-of consistently at some point.

    I should also add that stopping the training and restarting from the last checkpoint - implying that the training input files are shuffled again as well - results in the same behavior. The training always seems to behave similarly at that point.

  • sunside
    sunside about 7 years
    I had removed a separate loss function I was using and didn't see this problem anymore ... now I learn that I simply made my model worse instead. D'oh!
  • Kerem T
    Kerem T almost 5 years
    This is great advice. I personally use pytorch and default Adam eps is 1e-8 which is too low in my opinion. 1e-4 allowed me to train without gradient explosion with high learning rate and also no need for gradient clipping too!
  • AleB
    AleB about 4 years
    Can another remedy be beta_1 = beta_2 or it is different than changing epsilon?
  • Ben
    Ben almost 4 years
    "This is a known problem of Adam." Would you be able to link a resource where other discuss this behavior?
  • pkubik
    pkubik over 2 years
    I've seen this advice in several places but I didn't expect to find so sound explanation under a random question on SO. I think this might be an issue with the naming. epsilon just looks like some boring stability constant that you should set low and never bother with it.