# Multi-Armed bandit (MAB)

Simplest form of Reinforcement Learning problem.

Was an introductory chapter to RL to explore the Exploration and Exploitation problem.

This article is a good brush up for the math behind the MAB algorithms: https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration

Popular MAB algorithms, based on different ideas of ways to encourage exploration:

- Random Exploration
- Epsilon-Greedy
- Softmax
- Gaussian Noise (in the continous domain)

- Optimism in the face of uncertainty
- Optimistic Greedy
- Upper Confidence Bound
- Thompson Sampling (uses Probability Matching)

- Information State Space (Consider agentβs information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
- Gittins indices
- Bayes-adaptive MDPs

To compare the performances of various bandit algorithms, conduct a Parameter Study.

**Resources to look into**

- Go back home and review k-bandits from CS239 or something, the first chapter of the book.

### For the non-stationary problem (also known as βConcept Driftβ), we have

Some papers

- https://proceedings.neurips.cc/paper/2014/file/903ce9225fca3e988c2af215d4e544d3-Paper.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8004723/pdf/entropy-23-00380.pdf

Incremental implementation. This is very common, kind of like Incremental Mean.

The initial equation is $Q_{n}=(1βN_{i}1β)Q_{nβ1}+N_{i}1βx_{i}$

We used two approaches:

- Cap $N_{i}1β$ to $Ξ±1β$, where $Ξ±=100$ for example
- Use the form $Ξ±N_{i}β$ (weight decay factor)

We know that $Q_{n}=Nβx_{i}β$ $Q_{n}=(1βΞ±N_{i}β)Q_{nβ1}+(Ξ±N_{i}β)x_{i}$