Learning to Drive from a World Model
I found this paper to actually be very insightful, even thought it’s mostly an applied paper.
They had an episode on RoboPapers which was really helpful.
There are many interesting things that they argue for that I haven’t thought much about before:
- On-policy vs. off-policy arguments
- off-policy here refers to treating ground truth actions from expert demonstrations (human driving). They argue that it doesn’t work because if you just BC on the demonstrations, you end up with really shitty policies, because humans tend to drift off
- For example, a driver might be 50cm off the center lane for an extended period of time. If you BC tino your data, your policy will learn to do that
- Sure, like you could curate good expert demonstrations, that’s a lot of what the robot learning
- For example, a driver might be 50cm off the center lane for an extended period of time. If you BC tino your data, your policy will learn to do that
- An alternative is on-policy: you have the model learn ground truth from its own rollouts. How can it learn from ground truth?
- Traditionally, If you look at something like REINFORCE, you are just weighing your own rollouts by some advantage, so you know which behaviors you should focus on
- Here, we actually use a goal-conditioned world-model as the ground truth
- You can think of the policy as a distilled version
- You can think of this actually a hydrid-policy learning: the states that the policy visits are on-policy (since we sample from policy to step through world model ). However, the ground-truth actions that we learn on are off-policy. This provides a much stronger distillation signal to handle various out-of-distribution cases
- off-policy here refers to treating ground truth actions from expert demonstrations (human driving). They argue that it doesn’t work because if you just BC on the demonstrations, you end up with really shitty policies, because humans tend to drift off
- Policy has a multi-modal regression head: 5 hypotheses, each parameterized with a mean and a scale from Laplace Distribution, trained with NLL using the MHP Loss.
- Why don’t they use gaussian distributions?
Please correct me if there’s