Northeast Systems and Control Workshop

An Adversarial Analysis of Thompson Sampling

for Full-information Online Learning:

from Finite to Infinite Action Spaces

Joint work with Jeff Negrea

Alexander Terenin

https://avt.im/ · @avt_im

Probabilistic Decision-making

This talk: adversarial analogs of problems like this

The Online Learning Game

At each time $t = 1,..,T$:

  1. Learner picks a random action $x_t \~ p_t \in \c{M}_1(X)$.
  2. Adversary responds with a reward function $y_t : X \to \R$, chosen adaptively from a given function class.

Regret: $$ R(p,y) = \E_{x_t\~p_t} \ubr{\sup_{x\in X} \sum_{t=1}^T y_t(x)}{ \htmlClass{fragment}{ \substack{\t{single best action}\\\t{in hindsight}} } } - \ubr{\sum_{t=1}^T y_t(x_t)}{ \htmlClass{fragment}{ \substack{\t{total reward of}\\\t{learner's actions}} } } $$

Non-discrete Online Learning

Adversarial problem: seems to have nothing to do with Bayes' Rule?

No-regret Online Learning Algorithms

Given various no-regret online learning oracles, one can obtain:

Confidence sets

Generalization bounds

Sample-efficient reinforcement learning

Equilibrium computation algorithms

Online Learning: Discrete Action Spaces

Typical algorithm: mirror descent or follow-the-regularized-leader

$$ p_t = \argmax_{p\in\c{M}_1(X)} \E_{x\~p} \sum_{t=1}^T y_t(x) - \Gamma(p) $$

Parameterized by convex regularizer $\Gamma : X \to\R$

  • Not obvious how to pick $\Gamma$ in non-discrete settings
  • Standard choice of KL ignores adversary's smoothness class
  • Unclear how to perform numerics in general

Bayesian Algorithms for Adversarial Learning

Idea: think of this game in a Bayesian way

  1. Place a prior $q^{(\gamma)}$ on the adversary's future reward functions $\gamma_1,..,\gamma_T \in~\c{M}_1(Y)$
  2. Condition on observed rewards $\gamma_1 = y_1,..,\gamma_{t-1} = y_{t-1}$
  3. Draw a sample from the posterior, and play $$ x_t = \argmax_{x\in X} \sum_{\tau=1}^{t-1} y_\tau(x) + \sum_{\tau=t}^T \gamma_\tau(x) $$

Algorithm: Thompson sampling

Bayesian Algorithms for Online Learning

Thompson sampling: FTPL with very specific learning rates

  • Why should this be a good idea?

Result (Gravin, Peres, and Sivan, 2016). If $X = \{1,2,3\}$, the (infinite-horizon discounted) online learning game's Nash equilibrium is Thompson sampling with respect to a certain optimal prior.

Conjecture: Thompson sampling with strong priors is minimax-strong

This work: prove the finite and simplest infinite-dimensional case

Idea: Analyze Bayesian Algorithm in a Bayesian Way

Regret decomposition: $$ \ubr{R(p,y)}{ \htmlClass{fragment}{ \t{regret} } } = \ubr{R(p,q^{(\gamma)})}{ \htmlClass{fragment}{ \t{prior regret} } } + \ubr{E_{q^{(\gamma)}}(p,y)}{ \htmlClass{fragment}{ \t{excess regret} } } $$ where $$ E_{q^{(\gamma)}}(p,y) = \sum_{t=1}^T \Gamma^*_{t+1}(y_{1:t}) - \Gamma^*_t(y_{1:t-1}) - \pair{y_t}{p_t} + \E\pair{\gamma_t}{\smash{p_t^{(\gamma)}}} $$ and $\displaystyle\Gamma^*_t(f) = \sup_{x\in X} f(x) + \gamma_{t:T}(x)$.

Strong Priors for Adversarial Feedback

Strong prior over adversary:

  1. Equalizing: $\E \gamma(x) = \E \gamma(x')$ for all $x,x'$
  2. Certifies a sharp regret lower bound $$ R(\.,q^{(\gamma)}) \leq \min_p \max_q R(p,q) $$

Practical approximation: Gaussian process with matching smoothness

  • Matérn priors: widely-used in Bayesian optimization today
  • Straightforward and well-understood numerics

A Bregman Divergence Bound on Excess Regret

For a strong prior: $$ E_{q^{(\gamma)}}(p,y) \leq \sum_{t=1}^T \ubr{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}{\substack{\t{Bregman divergence}\\\t{induced by}\Gamma^*_t}} $$

Gaussian adversary: rates

  • Finite $X$ with $\ell^\infty$ adversary: $R(p,q) \leq 2\sqrt{T\log N}$
  • $X=[0,1]$, BL adversary: $R(p,q) \leq \c{O}\del{\beta\sqrt{T\log(1+\lambda)}}$

Thank you!

https://avt.im/· @avt_im

A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.