Is it good learning rate for Adam method?

39,573

Solution 1

The learning rate looks a bit high. The curve decreases too fast for my taste and flattens out very soon. I would try 0.0005 or 0.0001 as a base learning rate if I wanted to get additional performance. You can quit after several epochs anyways if you see that this does not work.

The question you have to ask yourself though is how much performance do you need and how close you are to accomplishing the performance required. I mean that you are probably training a neural network for a specific purpose. Often times you can get more performance out of the network by increasing its capacity, instead of fine tuning the learning rate which is pretty good if not perfect anyways.

Solution 2

You can start with a higher learning rate (say 0.1) to get out of local minima then decrease it to a very small value to let settle down things. To do this change the step size to say 100 iterations to reduce the size of the learning rate every 100 iterations. These numbers are truly unique to your problem and depend on multiple factors like your data scale.

Also keep in mind the validation loss behavior on the graph to see if you are overfitting the data.

Solution 3

I would like to be more specific in some statements of Juan. But my reputaton is not enough so I post it as an answer instead.

You should not be afraid of local minimums. In practice, as far as my understanding, we can classify them as 'good local minimums' and 'bad local minimums'. The reason why we want to have higher learning rate, as Juan said, is that we want to find a better 'good local minimum'. If you set your initial learning rate too high, that will be bad because your model will likely fall in 'bad local minimum' regions. And if that happens, 'decaying learning rate' practice cannot help you.

Then, how can we ensure that your weights will fall in the good region? The answer is we can't, but we can increase its possibility by choosing a good set of initial weights. Once again, a too big initial learning rate will make your initialization meaningless.

Secondly, it's always good to understand your optimizer. Take some time to look at its implementation, you will find something interesting. For example, 'learning rate' is not actually 'learning rate'.

In sum: 1/ Needless to say,a small learning rate is not good, but a too big learning rate is definitely bad. 2/ Weight initialization is your first guess, it DOES affect your result 3/ Take time to understand your code may be a good practice.

Solution 4

People have done a lot of experimentation when it comes to choosing hyper-parameter of adam and by far 3e-4 to 5e-4 are the best learning rates if you're learning the task from scratch.

Note if you're doing transfer learning and fine tuning the model then keep the learning rate low cause initially the gradients would be larger and backpropagation will affect the pre-trained model more drastically. You do not want that to happen at the start of training

Solution 5

Adam is an optimizer method, the result depend of two things: optimizer (including parameters) and data (including batch size, amount of data and data dispersion). Then, I think your presented curve is ok.

Concerning the learning rate, Tensorflow, Pytorch and others recommend a learning rate equal to 0.001. But in Natural Language Processing, the best results were achieved with learning rate between 0.002 and 0.003.

I made a graph comparing Adam (learning rate 1e-3, 2e-3, 3e-3 and 5e-3) with Proximal Adagrad and Proximal Gradient Descent. All of them are recommended to NLP, if this is your case.

enter image description here

Share:
39,573
John
Author by

John

Updated on July 21, 2022

Comments

  • John
    John almost 2 years

    I am training my method. I got the result as below. Is it a good learning rate? If not, is it high or low? This is my result

    enter image description here

    lr_policy: "step"
    gamma: 0.1
    stepsize: 10000
    power: 0.75
    # lr for unnormalized softmax
    base_lr: 0.001
    # high momentum
    momentum: 0.99
    # no gradient accumulation
    iter_size: 1
    max_iter: 100000
    weight_decay: 0.0005
    snapshot: 4000
    snapshot_prefix: "snapshot/train"
    type:"Adam"
    

    This is reference

    With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss enter image description here