Multi-Armed bandit (MAB)

Simplest form of Reinforcement Learning problem.

Problem Formulation

Imagine you’re in a casino with a row of slot machines, often referred to as “one-armed bandits.” Each machine has a different probability of paying out a reward when you pull its lever. However, you don’t know what these probabilities are ahead of time. Your goal is to find out which machine gives the best rewards over time by trying them out.

Was an introductory chapter to RL to explore the Exploration and Exploitation problem.

Resources

This article is a good brush up for the math behind the MAB algorithms: https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration

Popular MAB algorithms, based on different ideas of ways to encourage exploration:

Random Exploration
- Epsilon-Greedy
- Softmax
- Gaussian Noise (in the continous domain)
Optimism in the face of uncertainty
- Optimistic Greedy
- Upper Confidence Bound
- Thompson Sampling (uses Probability Matching)
Information State Space (Consider agent’s information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
- Gittins indices
- Bayes-adaptive MDPs

To compare the performances of various bandit algorithms, conduct a Parameter Study.

Resources to look into

Go back home and review k-bandits from CS239 or something, the first chapter of the book.

For the non-stationary problem (also known as “Concept Drift”), we have

Some papers

Incremental implementation. This is very common, kind of like Incremental Mean.

The initial equation is $Q_{n} = (1 - \frac{1}{N _{i}}) Q_{n - 1} + \frac{1}{N _{i}} x_{i}$

We used two approaches:

Cap $\frac{1}{N _{i}}$ to $\frac{1}{α}$ , where $α = 100$ for example
Use the form $\frac{N _{i}}{α}$ (weight decay factor)

We know that $Q_{n} = \frac{\sum x _{i}}{N}$ $Q_{n} = (1 - \frac{N _{i}}{α}) Q_{n - 1} + (\frac{N _{i}}{α}) x_{i}$

🛠️ Steven Gong

Table of Contents

Multi-Armed bandit (MAB)

For the non-stationary problem (also known as “Concept Drift”), we have

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Multi-Armed bandit (MAB)

For the non-stationary problem (also known as “Concept Drift”), we have

Related

Graph View

Backlinks