Gaussian Policy

From the spinning up,

A simple gaussian policy is $π (a ∣ s) = N (μ (s), Σ (s))$

Outputs only one mode (i.e., it’s unimodal).
Has limited flexibility: the shape of the distribution is fixed to be Gaussian.
Often uses diagonal covariance, so it can’t capture dependencies between action dimensions.

This paper talks about this problem https://arxiv.org/pdf/2507.07986.

Like PPO learned a gaussian policy.

But really, we have more expressive policies (i.e. Diffusion Model and Flow Matching models).

The paper is contrasting this simple Gaussian policy class with expressive policies, like:

Diffusion policies (which model action sequences as samples from a learned diffusion process)
Flow-matching models (which can model complex, multimodal behaviors with learned transport maps)

🛠️ Steven Gong