# Upper Confidence Bound Bandit Algorithm

UCB is an algorithm built around the idea of “Optimism in the face of uncertainty”, i.e. we estimate uncertainty of a value, and we prefer to explore states/actions with highest uncertainty.

https://towardsdatascience.com/the-upper-confidence-bound-ucb-bandit-algorithm-c05c2bf4c13f

$A_{t}=argmax_{a}[Q_{t}(a)+cN_{t}(a)logt ]$

- the number $c>0$ controls the degree of exploration
- $N_{t}(a)$ will be very small for the actions that haven’t been tried much, so this is how the agent is encouraged to try various actions until all actions have been tried enough times

Intuitively, the square-root term is a measure of the uncertainty or variance in the estimate of $a$’s value.