Behavior Cloning (BC)

$J (θ) = E [lo g_{π} (a ∣ s)]$

More intelligent BC can be done with AWR. The log probability can be done with

Where does this log probability come from? Negative Log Likelihood

Imitation Learning is often formulated as behavior cloning, which uses supervised learning to learn a policy $π_{θ}$ parameterized by $θ$ to maximize the log-likelihood of actions in a dataset $D$ : $E_{(s, a) \sim D} [lo g π_{θ} (a ∣ s)]$

Here $π (a ∣ s)$ is the likelihood, since it’s the probability of assigning the action $a$ given $s$

Maximizing $π (a ∣ s)$ just means: “make the model put high probability on the demonstrated (expert) actions.”

In discrete spaces → cross-entropy.
In continuous spaces → likelihood under a distribution (e.g., Gaussian).

If the policy “just always outputs one action,” it’s probably overfitting or collapsing due to data bias or model simplicity.

KL divergence shows up when comparing your learned policy to another (e.g., expert or prior) distribution.

What happens if we drop the log?

Like why can’t we maximize $E [π_{θ} (a ∣ s)]$ ? If you maximize the raw probability, it can lead to weird behaviors:

For distributions like Gaussians, the “probability” can be greater than 1, and maximizing it directly isn’t a proper scoring rule

It loses the property of penalizing low-probability assignments strongly.

You don’t recover MLE or Cross-Entropy Loss

Assume that our policy learned is just a univariate gaussian distribution, so we only need to learn $μ, σ$ . If you don’t have the log, it’s going to collapse to $σ = 0$ .

If you add the log, it’s the Log Likelihood

Implicit Behavioral Cloning

🛠️ Steven Gong

Table of Contents

Behavior Cloning (BC)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Behavior Cloning (BC)

Related

Graph View

Backlinks