Gaussian Policy
Learned from the spinning up,
A simple gaussian policy is
- Outputs only one mode (i.e., it’s unimodal).
- Has limited flexibility: the shape of the distribution is fixed to be Gaussian.
- Often uses diagonal covariance, so it can’t capture dependencies between action dimensions.
This paper talks about this problem https://arxiv.org/pdf/2507.07986.
Like PPO learned a gaussian policy.
But really, we have more expressive policies (i.e. Diffusion Model and Flow Matching models).
The paper is contrasting this simple Gaussian policy class with expressive policies, like:
- Diffusion policies (which model action sequences as samples from a learned diffusion process)
- Flow-matching models (which can model complex, multimodal behaviors with learned transport maps)
In discrete action space, you probably just use Softmax and generate probabilities that sum to 1:
In continuous action space, a gaussian policy would output:
What loss do you use?
- In the BC case:
- For discrete, Cross-Entropy Loss
- For continuous, STILL use Cross-Entropy Loss, but you have to do it n-times