Now, instead of the value function, we have the Q function that we want to determine.
GLIE Monte-Carlo Control
- Sample kth episode using
- For each state and action in the episode,
- Improve policy based on new action-value function
The above updates seem so similar to Multi-Armed Bandit updates!