$ϵ$ -Greedy

Very simple method to ensure that we are always exploring. An $ϵ$ -greedy policy is a simple method to ensure that we are continually exploring. $π (a ∣ s) = {\frac{ϵ}{∣ A ∣} + 1 - ϵ \frac{ϵ}{∣ A ∣} a = argmax Q (s, a) otherwise, where 0 \leq ϵ \leq 1$

Usually, $ϵ$ is a small number like 0.1

Greedy in the Limit of Infinite Exploration

$\epsilon$ -Greedy Policy Improvement

For any $ϵ$ -greedy policy $π$ , the $ϵ$ -greedy policy $π^{'}$ with respect to $q_{π}$ is an improvement, $v_{π^{'}} (s) \geq v_{π} (s)$ .

Pseudocode for MAB Algorithm https://medium.com/analytics-vidhya/multi-armed-bandits-part-1-epsilon-greedy-algorithm-with-python-code-534b9e2abc9

Choose epsilon; # exploration probability
Choose n; # number of iterations
for i = 1 to n do:
    p = pick a random number from 0 to 1 
    if p < epsilon:
        current_bandit = pick bandit at random  # explore
  else:
       current_bandit = pick best bandit so far  # exploit
  reward = current_bandit.pull_arm()
 
  Update estimation for what rewards the current bandit gives

Cons

Epsilon-greedy has linear regret, since it continues to explore the set of all actions even long after it has gained sufficient knowledge to know which of these actions are bad actions to take.

Suffers from Maximization Bias.

🛠️ Steven Gong

Table of Contents

$ϵ$ -Greedy

Cons

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

ϵ-Greedy

Cons

Related

Graph View

Backlinks

$ϵ$ -Greedy