# Policy

A policy is the agent’s behaviour, it is a map from state to action.

Deterministic policy: $a=π(s)$ Stochastic policy: $π(a∣s)=P[A_{t}=a∣S_{t}=s]$

Definition

A policy $π$ is a mapping from states to probabilities of selecting each possible action $π(a∣s)=P[A_{t}=a∣S_{t}=s]$

- A policy fully defines the behaviour of an agent
- MDP policies depend on the current state (not the history)
- i.e. Policies are stationary (time-independent), $A_{t}∼π(⋅∣S_{t}),∀t>0$

Just a policy alone is not very interesting, we want to find the Optimal Policy. This is known as policy search. We have several options:

- The number of deterministic policies is $∣A∣_{∣S∣}$. So you could just try them exhaustively and take the policy that returns the highest value functions. But that is not efficient
- Try Policy Iteration or Policy Iteration or Value Iteration using Policy Iteration or Value Iteration using Value Iteration using Dynamic Programming
- Look at Model-Free Control

### Some other terminologies

There are different types of policies. Make sure you know what each of these mean.

- Epsilon-greedy policies, see [[notes/Model-Free Control#epsilon -Greedy Exploration|Model-Free Control#epsilon -Greedy Exploration]]
- Soft Policy: all actions have a possibility of being explored, i.e.
- $π(a∣s)>0$ for all $s∈S$ and all $a∈A(s)$

- $ϵ$-soft policy is the combination of both definitions:
- $π(a∣s)>∣A(s)∣ϵ $ for all $s∈S$, all $a∈A(s)$, and for $ϵ$ > 0
- Target Policy
- Behaviour Policy

#### Stochastic Policies

I was trying to use the simple case of beating Rock-Paper-Scissors with RL. The policy it came up with is deterministic, i.e. always choosing scissors for example. I stumbled upon this link: https://ai.stackexchange.com/questions/10450/can-q-learning-be-used-to-derive-a-stochastic-policy

“Value based methods provide no mechanism to learn a correct distribution”, so we need to look into Policy-Gradient methods.

Instead you need to look into policy gradient methods, where the policy function is learned directly and can be stochastic. The most basic policy gradient algorithm is REINFORCE, and variations on Actor-Critic such as A3C are quite popular.

### Parametrized Policies

However, when you look at papers, you will more often than not see the notation $θ$, rather than $π$.

We call $θ$ a parametrized policy, because represents which a function with learnable parameters theta that map states to actions. You’ll see this notation in Policy Gradient Methods.

This is kind of hard to wrap my head around, you can start watching these series https://www.youtube.com/watch?v=8LEuyYXGQjU&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=8&ab_channel=StanfordOnline.

- In tabular methods / non-parametric policy representations, we use the notation $π$