# Multi-Armed bandit (MAB)

Simplest form of Reinforcement Learning problem.

Problem Formulation

Imagine you’re in a casino with a row of slot machines, often referred to as “one-armed bandits.” Each machine has a different probability of paying out a reward when you pull its lever. However, you

don’t knowwhat these probabilities are ahead of time. Your goal is to find out which machine gives the best rewards over time by trying them out.

Was an introductory chapter to RL to explore the Exploration and Exploitation problem.

Resources

- This article is a good brush up for the math behind the MAB algorithms: https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration

Popular MAB algorithms, based on different ideas of ways to encourage exploration:

- Random Exploration
- Epsilon-Greedy
- Softmax
- Gaussian Noise (in the continous domain)

- Optimism in the face of uncertainty
- Optimistic Greedy
- Upper Confidence Bound
- Thompson Sampling (uses Probability Matching)

- Information State Space (Consider agent’s information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
- Gittins indices
- Bayes-adaptive MDPs

To compare the performances of various bandit algorithms, conduct a Parameter Study.

**Resources to look into**

- Go back home and review k-bandits from CS239 or something, the first chapter of the book.

### For the non-stationary problem (also known as “Concept Drift”), we have

Some papers

- https://proceedings.neurips.cc/paper/2014/file/903ce9225fca3e988c2af215d4e544d3-Paper.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8004723/pdf/entropy-23-00380.pdf

Incremental implementation. This is very common, kind of like Incremental Mean.

The initial equation is $Q_{n}=(1−N_{i}1 )Q_{n−1}+N_{i}1 x_{i}$

We used two approaches:

- Cap $N_{i}1 $ to $α1 $, where $α=100$ for example
- Use the form $αN_{i} $ (weight decay factor)

We know that $Q_{n}=N∑x_{i} $ $Q_{n}=(1−αN_{i} )Q_{n−1}+(αN_{i} )x_{i}$