Talk for Computer Laboratory

Machines Regret Their Actions Too:

A Brief Tutorial on 

Multi-armed Bandits

Alexander Terenin

https://avt.im/ · @avt_im

Multi-armed Bandits

Multi-armed Bandits

Formalism

Goal: for a function $f : X \to [0,1]$ where $|X| = K \lt \infty$ find $$ \htmlClass{fragment}{ \max_{x\in X} f(x) } $$ based on noisy observations $f(x_t) + \eps_{x_t}$ where $\eps_{x_t}$ is random.

Given observations up to time $t$, how should one choose $x_{t+1}$?

Formalism

Regret: $$ \htmlClass{fragment}{ R(T) = \sum_{t=1}^T f(x^*) - f(x_t) } $$ where $\displaystyle x^* = \argmax_{x\in X} f(x)$.

Different strategies yield different regret asymptotics

Balancing Explore-Exploit Tradeoffs

A Regret Lower Bound

Theorem. For any algorithm there is an $f$ such that $$ \htmlClass{fragment}{ \E[R(T)] \geq \Omega(\sqrt{KT}) . } $$

Some regret is always incurred in order to learn $f$

Error Bars

Result (Hoeffding). Let $y_1,..,y_t$ be an IID sequence of random variables with values in $[0,1]$. Let $\bar{y}_t$ be their sample mean. Then $$ \htmlClass{fragment}{ \P(\E(y_1) \gt \bar{y}_t + \delta) \leq e^{-2t\delta^2} . } $$

Concentration inequalities enable us to construct error bars for $f$

Error Bars

Assume $f(x_t) + \eps_{x_t}$ satisfy Hoeffding's inequality. Choose $\delta,\eta$ so that $$ \htmlClass{fragment}{ \P\Bigg( |\bar{y}_t(x) - f(x)| \leq \underset{\htmlClass{fragment}{\sigma_t(x)}}{\undergroup{\sqrt{\frac{2\ln T}{n_t(x)}}}} \Bigg) \geq 1 - \eta . } $$ where

  • $n_t(x)$ is the expected number of times $x$ is selected by time $t$, and
  • $\bar{y}_t(x_t)$ is the empirical mean of $f(x_t) + \eps_{x_t}$ up to time $t$.

Error Bars

The Upper Confidence Bound Algorithm

$\displaystyle x_{t+1} = \argmax_{x\in X} f^+_t(x_t)$

$f_t^\pm(x) = \bar{y}_t(x) \pm \sigma_t(x)$

$\bar{y}_t$: empirical mean
$\sigma_t$: error bar width

The Upper Confidence Bound Algorithm

Theorem. Hoeffding–UCB's regret satisfies $$ \htmlClass{fragment}{ \E[R(T)] \leq \widetilde{\c{O}}(\sqrt{KT}) } $$ uniformly for all $f$.

Well-calibrated error bars lead to asymptotically efficient strategies

The Upper Confidence Bound Algorithm

Key idea. With sufficiently high probability, we have $$ \begin{aligned} \htmlClass{fragment}{\Delta(x_t)} &\htmlClass{fragment}{= f(x^*) - f(x_t)} \\ &\htmlClass{fragment}{\leq f^+_t(x^*) - f^-_t(x_t)} \\ &\htmlClass{fragment}{\leq f^+_t(x_t) - f^-_t(x_t)} \\ &\htmlClass{fragment}{= 2\sigma_t(x_t)} \htmlClass{fragment}{= {\scriptstyle 2\sqrt{\frac{2\ln T}{n_t(x)}}}} \htmlClass{fragment}{\overset{t=T}{=} \widetilde{\c{O}}(n_T^{-1/2})} \end{aligned} $$ using $f^-_t(x) \leq f(x) \leq f^+_t(x)$, and $f^+_t(x_t) \geq f^+_t(x)$, $\forall x,t$.

The Upper Confidence Bound Algorithm

The Upper Confidence Bound Algorithm

This gives $$ \begin{aligned} \htmlClass{fragment}{\E[R(T)]} &\htmlClass{fragment}{= \sum_{t=1}^T \underset{\htmlClass{fragment}{\Delta(x_t)}}{\undergroup{f(x^*) - f(x_t)}}} \htmlClass{fragment}{= \sum_{x\in X} \underset{\htmlClass{fragment}{\widetilde{\c{O}}(n_T^{-1/2})}}{\undergroup{\Delta(x)}} n_T(x)} \\ &\htmlClass{fragment}{\leq \sum_{x\in X} \widetilde{\c{O}}(\sqrt{n_T(x)})} \htmlClass{fragment}{\leq \widetilde{\c{O}}\Bigg(\sqrt{K\sum_{x\in X} n_T(x)}\Bigg)} \\ &\htmlClass{fragment}{= \widetilde{\c{O}}(\sqrt{KT}) .} \end{aligned} $$

Extensions

  • $\widetilde{\c{O}}(\sqrt{KT}) \leadsto \c{O}(\sqrt{KT})$
  • Adversarial and Contextual Bandits $$ \htmlClass{fragment}{ \min_{\eps\in\c{E}} \max_{x\in X} f(x) } $$
  • Bayesian methods and Thompson Sampling $$ \htmlClass{fragment}{ x_{t+1} = \argmax_{x\in X} \phi_t(x) } \qquad \htmlClass{fragment}{ \phi_t \~ f \given y_1,..,y_t } $$
  • Partial Monitoring and Information Directed Sampling

Reinforcement Learning

References