🛠️ Steven Gong

Search

Entropy-Regularized Reinforcement Learning
Related

Feb 11, 2026, 1 min read

Reinforcement Learning

Entropy-Regularized Reinforcement Learning

In standard reinforcement learning, the agent maximizes the expected return:

π max E_{π} [t \sum r (s_{t}, a_{t})]

In entropy regularized reinforcement learning, the objective becomes:

π max E_{π} [t \sum (r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t})))]

where:

$H (π (\cdot ∣ s_{t})) = - \sum_{a} π (a ∣ s_{t}) lo g π (a ∣ s_{t})$ is the entropy of the policy at state $s_{t}$
$α$ is a temperature coefficient that controls the trade-off between reward and entropy

Controlling exploration

You can think of this as the way to get exploration when we are working with continuous action space.

Related

n-step Reinforcement Learning

Graph View

Backlinks

Exploration and Exploitation
Policy Extraction

Created with Quartz, © 2026

Blog
LinkedIn
Twitter
GitHub