Multi-Armed bandit (MAB)

Simplest form of Reinforcement Learning problem.

Problem Formulation

Imagine you’re in a casino with a row of slot machines, often referred to as “one-armed bandits.” Each machine has a different probability of paying out a reward when you pull its lever. However, you don’t know what these probabilities are ahead of time. Your goal is to find out which machine gives the best rewards over time by trying them out.

Was an introductory chapter to RL to explore the Exploration and Exploitation problem.

Resources

Popular MAB algorithms, based on different ideas of ways to encourage exploration:

  • Random Exploration
  • Optimism in the face of uncertainty
  • Information State Space (Consider agent’s information as part of its state, and lookahead to see how information helps reward), this basically tansforms back the bandit problem into an MDP problem
    • Gittins indices
    • Bayes-adaptive MDPs

To compare the performances of various bandit algorithms, conduct a Parameter Study.

Resources to look into

  • Go back home and review k-bandits from CS239 or something, the first chapter of the book.

For the non-stationary problem (also known as “Concept Drift”), we have

Some papers

Incremental implementation. This is very common, kind of like Incremental Mean.

The initial equation is

We used two approaches:

  1. Cap to , where for example
  2. Use the form (weight decay factor)

We know that