At each time $t = 1,..,T$:
Regret: $$ R(p,y) = \E_{x_t\~p_t} \ubr{\sup_{x\in X} \sum_{t=1}^T y_t(x)}{ \htmlClass{fragment}{ \substack{\t{single best action}\\\t{in hindsight}} } } - \ubr{\sum_{t=1}^T y_t(x_t)}{ \htmlClass{fragment}{ \substack{\t{total reward of}\\\t{learner's actions}} } } $$
$$ p_t = \argmax_{p\in\c{M}_1(X)} \E_{x\~p} \sum_{t=1}^T y_t(x) - \Gamma(p) $$
Parameterized by convex regularizer $\Gamma : X \to\R$
Idea: think of this game in a Bayesian way
Thompson sampling: FTPL with very specific learning rates
Result (Gravin, Peres, and Sivan, 2016). If $X = \{1,2,3\}$, the (infinite-horizon discounted) online learning game's Nash equilibrium is Thompson sampling with respect to a certain optimal prior.
Regret decomposition: $$ \ubr{R(p,y)}{ \htmlClass{fragment}{ \t{regret} } } = \ubr{R(p,q^{(\gamma)})}{ \htmlClass{fragment}{ \t{prior regret} } } + \ubr{E_{q^{(\gamma)}}(p,y)}{ \htmlClass{fragment}{ \t{excess regret} } } $$ where $$ E_{q^{(\gamma)}}(p,y) = \sum_{t=1}^T \Gamma^*_{t+1}(y_{1:t}) - \Gamma^*_t(y_{1:t-1}) - \pair{y_t}{p_t} + \E\pair{\gamma_t}{\smash{p_t^{(\gamma)}}} $$ and $\displaystyle\Gamma^*_t(f) = \sup_{x\in X} f(x) + \gamma_{t:T}(x)$.
Strong prior over adversary:
Practical approximation: Gaussian process with matching smoothness
For a strong prior: $$ E_{q^{(\gamma)}}(p,y) \leq \sum_{t=1}^T \ubr{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}{\substack{\t{Bregman divergence}\\\t{induced by}\Gamma^*_t}} $$
Gaussian adversary: rates
A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.