World Model + RL

Inspired from Learning to Drive from a World Model.

Relevant papers:

Finetuning Offline World Models in the Real World

Hypothesis:

Value-learning / world-model learning can be off-policy
Policy-distillation must be on-policy
- You can get this sample-efficiency via world-model

We want to combine both advantages of RL and world model:

World model $W (s_{t}, a_{t}) = s_{t + 1}$
Value function $Q^{π} (s_{t}, a_{t}) = R$

Idea:

Why don’t you have your world model also just predict instant reward then?

A world model alone is not very useful to learn good signals. This is why in the Learning to Drive from a World Model, they condition it on future observations (very similar idea to goal-conditioned RL), and use that as a the supervised learning signal for expert trajectories.

However, how can we use

All of offline RL is on value-based learning, most successful way to do do policy extraction via DDPG + BC (maximizing Q, while staying close to expert dataset)

Here’s an algo:

train a world model W that predicts next states
Train Q from same dataset
Step policy through world model, do BoN sampling with Q, and steer policy in direction that maximizes Q

Stepping the policy through the world model is extremely important, because without it, you have no way of recovering

The classic problem of Distribution Shift leading to compounding errors:

Key components required to make this work:

World model needs to be accurate

New Idea

World model that predicts trajectories and can do stitching: $W (s, s^{'}) = τ$

This is an action chunk Then condition on goal state: $W (s, g) = τ$

Distill this trajectory down to policy (via flow matching error) \pi()

🛠️ Steven Gong

Table of Contents

World Model + RL

New Idea

Graph View

Backlinks