Multi-Armed Bandit

Thompson Sampling

Thompson Sampling is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem.

Other bandit algorithms select their actions based on the current averages of the rewards received from those actions.

However, rather than just refining an estimate of the mean reward, Thompson Sampling builds up a probability model from the obtained rewards, and then samples from this to choose an action.

In this way, not only is an increasingly accurate estimate of the possible reward obtained, but the model also provides a level of confidence in this reward, and this confidence increases as more samples are collected. This process of updating your beliefs as more evidence becomes available is known as Bayesian Inference.

There are 2 types for Thompson Sampling:

From the medium article.

Gaussian Thompson Socket

Sort of an incremental update

  • is the Precision of the actual data
  • is the number of times a particular arm has been tested
  • is the output received at each sample of this arm (equivalent to the reward that we’ve used up till now)
  • is the estimated mean (the mean of the distribution used to model the output)
  • is the total precision of the distribution used to model the output

From Ericsson AI project, we are taking some slight modifications, by capping and to ensure that we are constantly exploring.

class GaussianThompsonSocket(PowerSocket):
    def __init__(self, q):                
        self.τ_0 = 0.0001  # the posterior precision
        self.μ_0 = 1       # the posterior mean
        # pass the true reward value to the base PowerSocket             
    def sample(self):
        """ return a value from the the posterior normal distribution """
        return (np.random.randn() / np.sqrt(self.τ_0)) + self.μ_0    
    def update(self,R):
        """ update this socket after it has returned reward value 'R' """   
        # do a standard update of the estimated mean
        # update the mean and precision of the posterior
        self.μ_0 = ((self.τ_0 * self.μ_0) + (self.n * self.Q))/(self.τ_0 + self.n)        
        self.τ_0 += 1      

High level Pseudocode of the algorithm

  1. sample: Sample from your estimation. We initially start with all the arms with a similar probability.
    • We do this by sampling each arm and choosing the highest valued sample out of all arms
  2. playArm: Play this optimal arm according to yourself (the arm with the highest valued sample)
  3. update: Update for this particular arm so you get a mean closer to the actual value. Idea is that every time you are wrong (because the reward is higher), your update is going to get closer and closer to the real value.

Thompson Sampling Regret

The per-period regret of an algorithm over a time period is the difference between the mean reward of an optimal action and the action selected by the algorithm.

Bernoulli Thompson Sampling

This is what is used in the Ericsson AI project.

Ericsson AI Project for Tracking. This is to solve non-stationary problem

Weight decay facto, see Multi-Armed Bandit for details.


Other Resources

David Silver: “TS is sample-based probability matching”.