Imitation Learning

Behavior Cloning (BC)

Imitation Learning is often formulated as behavior cloning, which uses supervised learning to learn a policy parameterized by to maximize the log-likelihood of actions in a dataset :

MLE formulation:

  • As seen in What Matters for Batch Online Reinforcement Learning in Robotics papers
  • See Log Likehood if confused about where log is coming from
  • More intelligent BC can be done with
  • Maximizing just means: “make the model put high probability on the demonstrated (expert) actions.”
    • In discrete spaces → cross-entropy.
    • In continuous spaces → likelihood under a distribution (e.g., Gaussian).
      • Ground truth action is continuous, i.e. given some state, for each dimension we learn a gaussian

Regression formulation:

  • As seen in Robomimic paper
  • With this formulation, we lose the ability to learn uncertainty

In pi0, instead cross entropy, they use Flow Matching:

  • It looks like regression, but the model predicts a velocity field

If the policy “just always outputs one action,” it’s probably overfitting or collapsing due to data bias or model simplicity.

KL divergence shows up when comparing your learned policy to another (e.g., expert or prior) distribution.

What happens if we drop the log?

Like why can’t we maximize ? If you maximize the raw probability, it can lead to weird behaviors:

  • For distributions like Gaussians, the “probability” can be greater than 1, and maximizing it directly isn’t a proper scoring rule
  • It loses the property of penalizing low-probability assignments strongly.
  • You don’t recover MLE or Cross-Entropy Loss

Assume that our policy learned is just a univariate gaussian distribution, so we only need to learn . If you don’t have the log, it’s going to collapse to .