Imitation Learning

Behavior Cloning (BC)

  • More intelligent BC can be done with AWR. The log probability can be done with

Where does this log probability come from? Negative Log Likelihood

Imitation Learning is often formulated as behavior cloning, which uses supervised learning to learn a policy parameterized by to maximize the log-likelihood of actions in a dataset :

  • Here is the likelihood, since it’s the probability of assigning the action given

Maximizing just means: “make the model put high probability on the demonstrated (expert) actions.”

  • In discrete spaces → cross-entropy.
  • In continuous spaces → likelihood under a distribution (e.g., Gaussian).

If the policy “just always outputs one action,” it’s probably overfitting or collapsing due to data bias or model simplicity.

KL divergence shows up when comparing your learned policy to another (e.g., expert or prior) distribution.

What happens if we drop the log?

Like why can’t we maximize ? If you maximize the raw probability, it can lead to weird behaviors:

  • For distributions like Gaussians, the “probability” can be greater than 1, and maximizing it directly isn’t a proper scoring rule
  • It loses the property of penalizing low-probability assignments strongly.
  • You don’t recover MLE or Cross-Entropy Loss

Assume that our policy learned is just a univariate gaussian distribution, so we only need to learn . If you don’t have the log, it’s going to collapse to .