Very simple method to ensure that we are always exploring. An -greedy policy is a simple method to ensure that we are continually exploring. Choose the greedy action with probability , and choose a random action with probability . (might want to divide by if there are actions) so that total probability = 1.

  • Usually, is a small number like 0.1

Greedy in the Limit of Infinite Exploration

title: $\epsilon$-Greedy Policy Improvement
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $π'$ with respect to $q_π$ is an improvement, $v_{π'}(s) \geq v_π(s)$.

Pseudocode for MAB Algorithm https://medium.com/analytics-vidhya/multi-armed-bandits-part-1-epsilon-greedy-algorithm-with-python-code-534b9e2abc9

Choose epsilon; # exploration probability
Choose n; # number of iterations
for i = 1 to n do:
    p = pick a random number from 0 to 1 
    if p < epsilon:
        current_bandit = pick bandit at random  # explore
       current_bandit = pick best bandit so far  # exploit
  reward = current_bandit.pull_arm()

  Update estimation for what rewards the current bandit gives


Epsilon-greedy has linear regret, since it continues to explore the set of all actions even long after it has gained sufficient knowledge to know which of these actions are bad actions to take.