-Greedy
Very simple method to ensure that we are always exploring. An -greedy policy is a simple method to ensure that we are continually exploring. → Choose the greedy action with probability , and choose a random action with probability . (might want to divide by if there are actions) so that total probability = 1.
- Usually, is a small number like 0.1
Greedy in the Limit of Infinite Exploration
Pseudocode for MAB Algorithm https://medium.com/analytics-vidhya/multi-armed-bandits-part-1-epsilon-greedy-algorithm-with-python-code-534b9e2abc9
Choose epsilon; # exploration probability
Choose n; # number of iterations
for i = 1 to n do:
p = pick a random number from 0 to 1
if p < epsilon:
current_bandit = pick bandit at random # explore
else:
current_bandit = pick best bandit so far # exploit
reward = current_bandit.pull_arm()
Update estimation for what rewards the current bandit gives
Cons
Epsilon-greedy has linear regret, since it continues to explore the set of all actions even long after it has gained sufficient knowledge to know which of these actions are bad actions to take.