Policy

A policy is the agent’s behaviour, it is a map from state to action.

Deterministic policy: $a = π (s)$ Stochastic policy: $π (a ∣ s) = P [A_{t} = a ∣ S_{t} = s]$

Definition

A policy $π$ is a mapping from states to probabilities of selecting each possible action $π (a ∣ s) = P [A_{t} = a ∣ S_{t} = s]$

A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent), $A_{t} \sim π (\cdot ∣ S_{t}), \forall t > 0$

Types of policy

There are different types of policies. Make sure you know what each of these mean.

Epsilon-Greedy policies, see [[notes/Model-Free Control#epsilon -Greedy Exploration|Model-Free Control#epsilon -Greedy Exploration]]
Soft Policy: all actions have a possibility of being explored, i.e.
- $π (a ∣ s) > 0$ for all $s \in S$ and all $a \in A (s)$
$ϵ$ -soft policy is the combination of both definitions:
- $π (a ∣ s) > \frac{ϵ}{∣ A ( s ) ∣}$ for all $s \in S$ , all $a \in A (s)$ , and for $ϵ$ > 0
- Target Policy
- Behaviour Policy

Other:

Optimal Policy

Just a policy alone is not very interesting, we want to find the Optimal Policy. This is known as policy search. We have several options:

The number of deterministic policies is $∣ A ∣^{∣ S ∣}$ . So you could just try them exhaustively and take the policy that returns the highest value functions. But that is not efficient
Try Policy Iteration or Value Iteration using Dynamic Programming
Look at Model-Free Control

Stochastic Policies

I was trying to use the simple case of beating Rock-Paper-Scissors with RL. The policy it came up with is deterministic, i.e. always choosing scissors for example. I stumbled upon this link: https://ai.stackexchange.com/questions/10450/can-q-learning-be-used-to-derive-a-stochastic-policy

“Value based methods provide no mechanism to learn a correct distribution”, so we need to look into Policy-Gradient methods.

Instead you need to look into policy gradient methods, where the policy function is learned directly and can be stochastic. The most basic policy gradient algorithm is REINFORCE, and variations on Actor-Critic such as A3C are quite popular.

Parametrized Policies $π_{t} h e t a$

However, when you look at papers, you will more often than not see the notation $θ$ , rather than $π$ .

We call $θ$ a parametrized policy, because represents which a function with learnable parameters theta that map states to actions. You’ll see this notation in Policy Gradient Methods.

This is kind of hard to wrap my head around, you can start watching these series https://www.youtube.com/watch?v=8LEuyYXGQjU&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=8&ab_channel=StanfordOnline.

In tabular methods / non-parametric policy representations, we use the notation $π$

🛠️ Steven Gong

Table of Contents

Policy

Types of policy

Optimal Policy

Stochastic Policies

Parametrized Policies $π_{t} h e t a$

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Policy

Types of policy

Optimal Policy

Stochastic Policies

Parametrized Policies πt​heta

Graph View

Backlinks

Parametrized Policies $π_{t} h e t a$