Need good way to choose and adjust a "learning rate"

10,866

Solution 1

Sometimes the process of decreasing the learning rate over time is called "annealing" the learning rate.

There are many possible "annealing schedules", like having the learning rate be a linear function of time:

u(t) = c / t

...where c is some constant. Or there is the "search-then-converge" schedule:

u(t) = A * (1 + (c/A)*(t/T)) / 
           (1 + (c/A)*(t/T) + T*(t^2)/(T^2))

...which keeps the learning rate around A when t is small compared to T (the "search" phase) and then decreases the learning rate when t is large compared to T (the "converge" phase). Of course, for both of these approaches you have to tune parameters (e.g. c, A, or T) but hopefully introducing them will help more than it will hurt. :)

Some references:

  • Learning Rate Schedules for Faster Stochastic Gradient Search, Christian Darken, Joseph Chang and John Moody, Neural Networks for Signal Processing 2 --- Proceedings of the 1992 IEEE Workshop, IEEE Press, Piscataway, NJ, 1992.
  • A Stochastic Approximation Method, Herbert Robbins and Sutton Monro, Annals of Mathematical Statistics 22, #3 (September 1951), pp. 400–407.
  • Neural Networks and Learning Machines (section 3.13 in particular), Simon S. Haykin, 3rd edition (2008), ISBN 0131471392, 9780131471399
  • Here is a page that briefly discusses learning rate adaptation.

Solution 2

You answered your own question when you said you need to have your learning rate change as the network learns. There are a lot of different ways you can do it.

The simplest way is to reduce the learning rate linearly with number of iterations. Every 25 (or some other arbitrary number), subtract a portion off of the rate until it gets to a good minimum.

You can also do it nonlinearly with number of iterations. For example, multiply the learning rate by .99 every iteration, again until it reaches a good minimum.

Or you can get more crafty. Use the results of the network to determine the network's next learning rate. The better it's doing by its fitness metric, the smaller you make its learning rate. That way it will converge quickly for as long as it needs to, then slowly. This is probably the best way, but it's more costly than the simple number-of-iteration approaches.

Solution 3

Have you considered other training methods that are independent of any learning rate?

There are training methods which bypass the need for a learning-rate that calculate the Hessian matrix (like Levenberg-Marquardt), while I have come across direct-search algorithms (like those developed by Norio Baba).

Solution 4

Perhaps code in a negative-feedback loop into the learning algorithm, keyed to the rate. Learning rate values that start to swing too wide hit the moderating part of the feedback loop, causing it to swing the other way, at which point the opposing moderation force kicks in.

The state vector will eventually settle into an equilibrium that strikes a balance between "too much" and "too little". It's how many systems in biology work

Share:
10,866
sanity
Author by

sanity

I'm an entrepreneur and computer scientist, with a particular interest in Artificial Intelligence and Peer-to-Peer. My two most notable projects are Freenet and Revver (I'm founder and co-founder respectively). My current projects are a predictive analytics system called SenseArray, and a new approach to distributed computation called Swarm. You can find my personal blog here. While I've used C, C++, ML, Haskell, Prolog, Python, even Perl in the past, these days I do most of my programming in Java. I am gaining experience with Scala though and expect to become my primary language as it and its tools mature. I was honored to be asked by Scala's creator to be on the program committee for the first Scala workshop.

Updated on June 05, 2022

Comments

  • sanity
    sanity almost 2 years

    In the picture below you can see a learning algorithm trying to learn to produce a desired output (the red line). The learning algorithm is similar to a backward error propagation neural network.

    The "learning rate" is a value that controls the size of the adjustments made during the training process. If the learning rate is too high, then the algorithm learns quickly but its predictions jump around a lot during the training process (green line - learning rate of 0.001), if it is lower then the predictions jump around less, but the algorithm takes a lot longer to learn (blue line - learning rate of 0.0001).

    The black lines are moving averages.

    How can I adapt the learning rate so that it converges to close to the desired output initially, but then slows down so that it can hone in on the correct value?

    learning rate graph http://img.skitch.com/20090605-pqpkse1yr1e5r869y6eehmpsym.png

  • David J. Harris
    David J. Harris over 10 years
    Search-then-converge actually has a more complex definition than what you wrote here. Your formula isn't ever nearly constant. See this paper (PDF). Edited to add: It looks like error was originally in your source material. So it wasn't your fault, but it's still worth noting.
  • Nate Kohl
    Nate Kohl over 10 years
    @DavidJ.Harris good catch. I've updated the search-then-converge schedule.