At each time $t = 1,..,T$:
Regret: $$ R(p,y) = \E_{x_t\~p_t} \ubr{\sup_{x\in X} \sum_{t=1}^T y_t(x)}{ \htmlClass{fragment}{ \substack{\t{single best action}\\\t{in hindsight}} } } - \ubr{\sum_{t=1}^T y_t(x_t)}{ \htmlClass{fragment}{ \substack{\t{total reward of}\\\t{learner's actions}} } } $$
$$ p_t = \argmax_{p\in\c{M}_1(X)} \E_{x\~p} \sum_{t=1}^T y_t(x) - \Gamma(p) $$
Parameterized by convex regularizer $\Gamma : X \to\R$
Idea: think of this game in a Bayesian way
Thompson sampling: FTPL with very specific learning rates
Result (Gravin, Peres, and Sivan, 2016). If $X = \{1,2,3\}$, the (infinite-horizon discounted) online learning game's Nash equilibrium is Thompson sampling with respect to a certain optimal prior.
Regret decomposition: $$ \ubr{R(p,y)}{ \htmlClass{fragment}{ \t{regret} } } = \ubr{R(p,q^{(\gamma)})}{ \htmlClass{fragment}{ \t{prior regret} } } + \ubr{E_{q^{(\gamma)}}(p,y)}{ \htmlClass{fragment}{ \t{excess regret} } } $$ where $$ E_{q^{(\gamma)}}(p,y) = \sum_{t=1}^T \Gamma^*_{t+1}(y_{1:t}) - \Gamma^*_t(y_{1:t-1}) - \pair{y_t}{p_t} + \E\pair{\gamma_t}{\smash{p_t^{(\gamma)}}} $$ and $\displaystyle\Gamma^*_t(f) = \E\sup_{x\in X} f(x) + \gamma_{t:T}(x)$.
Strong prior over adversary:
Practical approximation: Gaussian process with matching smoothness
For a strong prior: $$ E_{q^{(\gamma)}}(p,y) \leq \sum_{t=1}^T \ubr{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}{\substack{\t{Bregman divergence}\\\t{induced by}\Gamma^*_t}} $$
Gaussian adversary: rates
Taylor form of the Bregman divergence: $$ \begin{gathered} \htmlClass{fragment}{ D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1}) = \frac{1}{2} \int_0^1 \ubr{\p^2_{y_t,y_t} \Gamma^*_t(y_{1:t-1} + \alpha y_t)}{ \htmlClass{fragment}{ \t{Gâteaux Hessian} } } \d\alpha } \\ \htmlClass{fragment}{ \p^2_{u,v} \Gamma^*_t(f) = \frac{1}{\sqrt{T-t+1}} \E u(\ubr{x^*_{f+\gamma_{t:T}}}{ \htmlClass{fragment}{ \substack{\t{perturbed}\\\t{maximizer}} } }) \pair{\gamma_{t:T}}{\smash{\ubr{\c{K}^{-1}\vphantom{x^*_f}}{ \htmlClass{fragment}{ \substack{\t{covariance}\\\t{operator}} } }}v} } \end{gathered} $$
Classical finite-dimensional argument: $\norm{\Gamma^*_t(f)}_{L_{\infty,1}} \leq 2 \tr \Gamma^*_t(f)$
Idea: condition on maximizer and work with truncated normals
Theorem. Suppose adversary satisfies $\norm{y}_\infty \leq \beta$ and prior is IID over time with constant variance $k(x,x) = \sigma^2$. Suppose that $$ \sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq C_{Y,k} \del{1 - \frac{k(x,x')}{k(x',x')}} . $$ Then we have $$ D_{\Gamma^*_t}(y_{1:t} \from y_{1:t-1})\leq \frac{\beta^2 + \beta C_{Y,k}}{2\sigma^2\sqrt{T-t+1}} \E \sup_{x\in X} \gamma_{t:T}(x) . $$
Bounded Lipschitz ($\beta,\lambda$) adversary: Matérn kernel with length scale $\kappa$ $$ \sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq \ubr{\frac{\beta (\lambda + \frac{1}{\kappa})}{\frac{1}{\kappa} \del{1 - e^{-\frac{2}{\lambda\kappa+1}}}}}{ \htmlClass{fragment}{ C_{Y,k} } } \del{1 - \frac{k(x,x')}{k(x',x')}} $$
Thompson sampling regret: $$ R(p,q) \leq \beta \del{32 + \frac{32}{1 - \frac{1}{e}}}\sqrt{T d\log\del{1 + \sqrt{d}\frac{\lambda}{\beta}}} $$
To explore by random actions, don't be too predictable, and match smoothness
Non-discrete online learning:
Future work:
A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.