Why is a target network required?

14,686

So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?

The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.

This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.

Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.


On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Share:
14,686

Related videos on Youtube

tandem
Author by

tandem

Updated on December 18, 2021

Comments

  • tandem
    tandem almost 2 years

    I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”

    I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.

    The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by

    Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
    

    Where alpha and gamma are learning and discount factors. I can understand that the reinforcement learning algorithm will become unstable and diverge.

    • The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.

    • This is where I fail.

    • Let me break the paragraph from the paper down here for discussion
      • The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
      • and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
      • We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.

    So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?

    But I do not understand how it is going to solve it?

    • mimoralea
      mimoralea almost 5 years
      One thing is that update looks like SARSA to me. You seem to be using the actual next action you took a', instead of the max over the actions in the next step. At least I don't see the max in the equation.
  • tandem
    tandem almost 5 years
    That's a fantastic explanation. Thanks for that. I started looking into DQN, PPO, and a3C. Anything else you suggest?
  • Nick Larsen
    Nick Larsen almost 5 years
    Here is a playlist I highly recommend, youtube.com/playlist?list=PLAdk-EyP1ND8MqJEJnSvaoUShrAWYe51U‌​, lecture 3 specifically covers DQN and is given by the author of the paper you referenced.
  • Nick Larsen
    Nick Larsen almost 5 years
    @tandem looks like I had it backwards so deleted the last comment. The target is what you use to evaluate your error, so you are updating the control network each time, you're just not changing the values you measure error against on each time step. Conceptually it's the same idea, I just had the direction of update backwards.
  • tandem
    tandem over 4 years
    Thanks for this. I finally watched the series of lectures and it helped. One question that still bothers me. When we say TD(lambda), how do we set lambda to update the loss?
  • WaterGenie
    WaterGenie almost 4 years
    @NickLarsen can you comment more on the last point you made? What is the go-to alternative over DQN? Are you referring to more recent results like rainbow and ape-x? Or is it about the value-based approach in general? Sorry, I am quite new to the whole RL scene so I am not familiar with all the standard algorithms/approaches yet.
  • Nick Larsen
    Nick Larsen almost 4 years
    @tandem I don't know of any improvements over Sutton `88 which introduced it. Essentially you just try a range of lambda values, graph them and choose the one that works the best for your problem.
  • Nick Larsen
    Nick Larsen almost 4 years
    @Thirdwater in general, learning policies is much faster for problems where you would need to use a deep network as your function approximator, and there are now methods which learn both policy and value (e.g. actor critic methods) which learn faster for most problems.
  • tandem
    tandem almost 4 years
    @Thirdwater, I can highly recommend the youtube course linked above.
  • tandem
    tandem over 3 years
    @NickLarsen: I found a method that actually shows how to vary the lambda (arxiv.org/pdf/1703.01327.pdf) My assignment for today is to understand it
  • PeterBe
    PeterBe about 2 years
    @NickLarsen: you wrote "with DQN you are updating many [state/action value]" --> Why is this happening and why can't you just not update 1 state/action value? Futher question: So is the real DQN not updated until a better policy is found in the target network? Does the target network just change after every iteration and the real DQN uses after some iteration the structure of the target network and copies it to its own structure if the results of one of the many target networks are better (I assume after every step a new target network is created and saved in the memory).
  • PeterBe
    PeterBe about 2 years
    And why DQN is essentially obsolete as you wrote?