Multi-Armed bandit (MAB)
Simplest form of Reinforcement Learning problem.
Was an introductory chapter to RL to explore the Exploration and Exploitation problem.
This article is a good brush up for the math behind the MAB algorithms: https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration
Popular MAB algorithms, based on different ideas of ways to encourage exploration:
- Random Exploration
- Epsilon-Greedy
- Softmax
- Gaussian Noise (in the continous domain)
- Optimism in the face of uncertainty
- Optimistic Greedy
- Upper Confidence Bound
- Thompson Sampling (uses Probability Matching)
- Information State Space (Consider agentβs information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
- Gittins indices
- Bayes-adaptive MDPs
To compare the performances of various bandit algorithms, conduct a Parameter Study.
Resources to look into
- Go back home and review k-bandits from CS239 or something, the first chapter of the book.
For the non-stationary problem (also known as βConcept Driftβ), we have
Some papers
- https://proceedings.neurips.cc/paper/2014/file/903ce9225fca3e988c2af215d4e544d3-Paper.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8004723/pdf/entropy-23-00380.pdf
Incremental implementation. This is very common, kind of like Incremental Mean.
The initial equation is
We used two approaches:
- Cap to , where for example
- Use the form (weight decay factor)
We know that