Multi-Armed bandit (MAB)
Simplest form of Reinforcement Learning problem.
Was an introductory chapter to RL to explore the Exploration and Exploitation problem.
This article is a good brush up for the math behind the MAB algorithms: https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration
Popular MAB algorithms, based on different ideas of ways to encourage exploration:
- Random Exploration
- Optimism in the face of uncertainty
- Information State Space (Consider agent’s information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
- Gittins indices
- Bayes-adaptive MDPs
To compare the performances of various bandit algorithms, conduct a Parameter Study.
Resources to look into
- Go back home and review k-bandits from CS239 or something, the first chapter of the book.
For the non-stationary problem (also known as “Concept Drift”), we have
Incremental implementation. This is very common, kind of like Incremental Mean.
The initial equation is
We used two approaches:
- Cap to , where for example
- Use the form (weight decay factor)
We know that