Deep Q-Network (DQN)
When to use DQN?
- DQN is sample-efficient, but often not as stable. DQN is off-policy, see Policy Gradient Methods for on-policy methods.
Notes from Pieter Abbeel
Unlike in regular Q-Learning, instead of only having to update , we also have to update in .
Taken from the original paper:
There are a few interesting things from the above pseudocode:
- They use two q-values: (the Q that we are learning) and . This helps stabilize the learning, as the two values are delayed between one another
- is “lagging behind” Q
- are stacked frames in the Atari game, because a single frame doesn’t have enough information (you need to know velocity, which direction the ball is moving)
All DQN implementations today use Double DQN just because it is better.
- Uses experience replay and fixed Q-targets
- Uses stochastic gradient descent
Achieved human-level performance on a number of Atari Games.
Has two components
- Experience Replay: All the episode steps are stored in one replay memory . has experience tuples over many episodes. During Q-learning updates, samples are drawn at random from the replay memory and thus one sample could be used multiple times. Experience replay improves data efficiency, removes correlations in the observation sequences, and smooths over changes in the data distribution
- Periodically Updated Target: Q is optimized towards target values that are only periodically updated. The Q network is cloned and kept frozen as the optimization target every C steps (C is a hyperparameter). This modification makes the training more stable as it overcomes the short-term oscillations.
Sample an experience from the dataset. Compute the target value.
- Double DQN
- Use different weights, and
- Dueling DQN
Uses the advantage function,