World Model + RL

Inspired from Learning to Drive from a World Model.

Relevant papers:

Hypothesis:

  • Value-learning / world-model learning can be off-policy
  • Policy-distillation must be on-policy
    • You can get this sample-efficiency via world-model

We want to combine both advantages of RL and world model:

  • World model
  • Value function

Idea:

  • Why don’t you have your world model also just predict instant reward then?

A world model alone is not very useful to learn good signals. This is why in the Learning to Drive from a World Model, they condition it on future observations (very similar idea to goal-conditioned RL), and use that as a the supervised learning signal for expert trajectories.

  • However, how can we use

All of offline RL is on value-based learning, most successful way to do do policy extraction via DDPG + BC (maximizing Q, while staying close to expert dataset)

Here’s an algo:

train a world model W that predicts next states
Train Q from same dataset
Step policy through world model, do BoN sampling with Q, and steer policy in direction that maximizes Q
  • Stepping the policy through the world model is extremely important, because without it, you have no way of recovering

The classic problem of falling out of distribution

Key components required to make this work:

  • World model needs to be accurate