World Model + RL
Inspired from Learning to Drive from a World Model.
Relevant papers:
Hypothesis:
- Value-learning / world-model learning can be off-policy
- Policy-distillation must be on-policy
- You can get this sample-efficiency via world-model
We want to combine both advantages of RL and world model:
- World model
- Value function
Idea:
- Why don’t you have your world model also just predict instant reward then?
A world model alone is not very useful to learn good signals. This is why in the Learning to Drive from a World Model, they condition it on future observations (very similar idea to goal-conditioned RL), and use that as a the supervised learning signal for expert trajectories.
- However, how can we use
All of offline RL is on value-based learning, most successful way to do do policy extraction via DDPG + BC (maximizing Q, while staying close to expert dataset)
Here’s an algo:
train a world model W that predicts next states
Train Q from same dataset
Step policy through world model, do BoN sampling with Q, and steer policy in direction that maximizes Q
- Stepping the policy through the world model is extremely important, because without it, you have no way of recovering
The classic problem of Distribution Shift leading to compounding errors:
Key components required to make this work:
- World model needs to be accurate
New Idea
World model that predicts trajectories and can do stitching:
- This is an action chunk Then condition on goal state:
Distill this trajectory down to policy (via flow matching error) \pi()