Reinforcement Learning

Reinforcement Learning Tradeoffs

Policy optimization methods are generally more, because they directly optimize for the objective. However, they tend to be On-Policy, which makes them quite unstable.

On the other hand, we have many Off-Policy methods based on Q-Learning, which is much more sample efficient because it can still learn while it is off-policy. However, they tend to be unstable.

“Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.”