Classifier-Free Diffusion Guidance (CFG)

Prior to this, see Diffusion Models Beat GANs on Image Synthesis where they used classifier guidance to condition diffusion models on text.

Combine conditional and unconditional estimates.

θ (z_{λ}, c) = (1 + w) \tilde{θ} (z_{λ}, c) - w \tilde{θ} (z_{λ})

$\tilde{θ} (z_{λ}, c)$ — model’s prediction of the clean sample (or noise) conditioned on
$c$ (the conditioning signal, e.g., action, text, or latent control input)
$\tilde{θ} (z_{λ})$ — model’s unconditional prediction (no conditioning)
$w$ — guidance weight or scale (often denoted $s$ in Ho & Salimans 2021)
$z_{λ}$ — the current latent (noisy sample) at noise level $λ$

The core of this technique

Classifier-Free Guidance increases controllability by amplifying the difference between the model’s conditional and unconditional predictions during denoising.

https://getimg.ai/guides/interactive-guide-to-stable-diffusion-guidance-scale-parameter “CFG Scale parameter”

But like, why not just always do conditioning?

If my model already takes the condition $c$ , why do I need this classifier-free guidance trick at all? Why not just trust the network’s conditional output?

Like the issue is that $\tilde{θ} (z_{λ}, c)$ might entirely ignore $c$ . The loss we minimize is $E [∥ θ - \tilde{θ} (z_{λ}, c) ∥^{2}]$ .

But with CFG, what we do is first detect how much it actually changes the prediction:

Δ_{c} = \tilde{θ} (z_{λ}, c) - \tilde{θ} (z_{λ})

Then we rescale that difference:

θ (z_{λ}, c) = \tilde{θ} (z_{λ}) + w Δ_{c} = (1 + w) \tilde{θ} (z_{λ}, c) - w \tilde{θ} (z_{λ})

If $w > 0$ , this artificially amplifies the conditional influence.

It’s a post-hoc correction that compensates for the model’s tendency to ignore the condition $c$ under the plain MSE training objective.

💡 Intuition

The equation blends the conditional and unconditional predictions:

θ (z_{λ}, c) = \tilde{θ} (z_{λ}) + w [\tilde{θ} (z_{λ}, c) - \tilde{θ} (z_{λ})]

$w = 0$ → unconditional generation (no conditioning influence)
$w = 1$ → normal conditional generation
$w > 1$ → stronger conditioning effect (e.g., tighter action or prompt adherence)

🛠️ Steven Gong

Table of Contents

Classifier-Free Diffusion Guidance (CFG)

💡 Intuition

Graph View

Backlinks