World Model + RL

Inspired from Learning to Drive from a World Model.

Relevant papers:

Hypothesis:

  • Value-learning / world-model learning can be off-policy
  • Policy-distillation must be on-policy
    • You can get this sample-efficiency via world-model

We want to combine both advantages of RL and world model:

  • World model
  • Value function

Idea:

  • Why don’t you have your world model also just predict instant reward then?

A world model alone is not very useful to learn good signals. This is why in the Learning to Drive from a World Model, they condition it on future observations (very similar idea to goal-conditioned RL), and use that as a the supervised learning signal for expert trajectories.

  • However, how can we use

All of offline RL is on value-based learning, most successful way to do do policy extraction via DDPG + BC (maximizing Q, while staying close to expert dataset)

Here’s an algo:

train a world model W that predicts next states
Train Q from same dataset
Step policy through world model, do BoN sampling with Q, and steer policy in direction that maximizes Q
  • Stepping the policy through the world model is extremely important, because without it, you have no way of recovering

The classic problem of Distribution Shift leading to compounding errors:

Key components required to make this work:

  • World model needs to be accurate

New Idea

World model that predicts trajectories and can do stitching:

  • This is an action chunk Then condition on goal state:

Distill this trajectory down to policy (via flow matching error) \pi()