Reinforcement Learning Tradeoffs
Policy optimization methods are generally more, because they directly optimize for the objective. However, they tend to be On-Policy, which makes them quite unstable.
On the other hand, we have many Off-Policy methods based on Q-Learning, which is much more sample efficient because it can still learn while it is off-policy. However, they tend to be unstable.
“Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.”
- From Spinning up https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html