# Thompson Sampling

Thompson Sampling is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem.

Other bandit algorithms select their actions based on the current averages of the rewards received from those actions.

However, rather than just refining an estimate of the mean reward, Thompson Sampling builds up a probability model from the obtained rewards, and then samples from this to choose an action.

In this way, not only is an increasingly accurate estimate of the possible reward obtained, but the model also provides a level of confidence in this reward, and this confidence increases as more samples are collected. This process of updating your beliefs as more evidence becomes available is known as Bayesian Inference.

There are 2 types for Thompson Sampling:

- Gaussian Thompson Sampling (based on Gaussian Distribution)
- Bernoulli Thompson Sampling (based on Bernoulli Distribution)

From the medium article.

### Gaussian Thompson Socket

Sort of an incremental update $τ_{0}←τ_{0}+nτ$ $μ_{0}←τ_{0}+nττ_{0}μ_{0}+τ∑_{i=1}x_{i} $

- $τ$ is the Precision of the actual data
- $n$ is the number of times a particular arm has been tested
- $x_{i}$ is the output received at each sample $i$ of this arm (equivalent to the reward $R_{i}$ that we’ve used up till now)
- $μ_{0}$ is the estimated mean (the mean of the distribution used to model the output)
- $τ_{0}$ is the total precision of the distribution used to model the output

From Ericsson AI project, we are taking some slight modifications, by capping $τ_{0}$ and $N_{i}$ to ensure that we are constantly exploring.

High level Pseudocode of the algorithm

`sample`

: Sample from your estimation. We initially start with all the arms with a similar probability.- We do this by sampling each arm and choosing the highest valued sample out of all arms

`playArm`

: Play this optimal arm according to yourself (the arm with the highest valued sample)`update`

: Update for this particular arm so you get a mean closer to the actual value. Idea is that every time you are wrong (because the reward is higher), your update is going to get closer and closer to the real value.

### Thompson Sampling Regret

The per-period regret of an algorithm over a time period $t$ is the difference between the mean reward of an optimal action and the action selected by the algorithm.

#### Bernoulli Thompson Sampling

This is what is used in the Ericsson AI project.

### Ericsson AI Project for Tracking. This is to solve non-stationary problem

Weight decay facto, see Multi-Armed Bandit for details. $1−αN_{i} $

### Concepts

Other Resources

- https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
- Multi-Armed Bandits: Part 5 | by Steve Roberts | Towards Data Science
- Multi-Armed Bandits
- For unknown mean and unknown variance (more complicated and computationally exhaustive): https://towardsdatascience.com/thompson-sampling-using-conjugate-priors-e0a18348ea2d

David Silver: “TS is sample-based probability matching”.