# $ϵ$-Greedy

Very simple method to ensure that we are always exploring. An $ϵ$-greedy policy is a simple method to ensure that we are continually exploring. $π(a∣s)={1−ϵϵ a=argmaxQ(s,a)otherwise $ -> Choose the greedy action with probability $1−ϵ$, and choose a random action with probability $ϵ$. (might want to divide by $m−1$ if there are $m$ actions) so that total probability = 1.

- Usually, $ϵ$ is a small number like 0.1

Greedy in the Limit of Infinite Exploration

```
title: $\epsilon$-Greedy Policy Improvement
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $π'$ with respect to $q_π$ is an improvement, $v_{π'}(s) \geq v_π(s)$.
```

Pseudocode for MAB Algorithm https://medium.com/analytics-vidhya/multi-armed-bandits-part-1-epsilon-greedy-algorithm-with-python-code-534b9e2abc9

```
Choose epsilon; # exploration probability
Choose n; # number of iterations
for i = 1 to n do:
p = pick a random number from 0 to 1
if p < epsilon:
current_bandit = pick bandit at random # explore
else:
current_bandit = pick best bandit so far # exploit
reward = current_bandit.pull_arm()
Update estimation for what rewards the current bandit gives
```

### Cons

Epsilon-greedy has linear regret, since it continues to explore the set of all actions even long after it has gained sufficient knowledge to know which of these actions are bad actions to take.