Invited talk for INFORMS Applied Probability Society

An Adversarial Analysis of Thompson Sampling

for Full-information Online Learning:

from Finite to Infinite Action Spaces

Joint work with Jeff Negrea

Alexander Terenin

https://avt.im/ · @avt_im

Probabilistic Decision-making

This talk: adversarial analogs of problems like this

The Online Learning Game

At each time $t = 1,..,T$:

Learner picks a random action $x_t \~ p_t \in \c{M}_1(X)$.
Adversary responds with a reward function $y_t : X \to \R$, chosen adaptively from a given function class.

Regret: $$ R(p,y) = \E_{x_t\~p_t} \ubr{\sup_{x\in X} \sum_{t=1}^T y_t(x)}{ \htmlClass{fragment}{ \substack{\t{single best action}\\\t{in hindsight}} } } - \ubr{\sum_{t=1}^T y_t(x_t)}{ \htmlClass{fragment}{ \substack{\t{total reward of}\\\t{learner's actions}} } } $$

Non-discrete Online Learning

Adversarial problem: seems to have nothing to do with Bayes' Rule?

No-regret Online Learning Algorithms

Given various no-regret online learning oracles, one can obtain:

Confidence sets

Generalization bounds

Sample-efficient reinforcement learning

Equilibrium computation algorithms

Online Learning: Discrete Action Spaces

Typical algorithm: mirror descent or follow-the-regularized-leader

$$ p_t = \argmax_{p\in\c{M}_1(X)} \E_{x\~p} \sum_{t=1}^T y_t(x) - \Gamma(p) $$

Parameterized by convex regularizer $\Gamma : X \to\R$

Not obvious how to pick $\Gamma$ in non-discrete settings
Standard choice of KL ignores adversary's smoothness class
Unclear how to perform numerics in general

Adversarial problem: seemingly nothing to do with Bayesian learning?

Bayesian Algorithms for Adversarial Learning

Idea: think of this game in a Bayesian way

Place a prior $q^{(\gamma)}$ on the adversary's future reward functions $\gamma_1,..,\gamma_T \in~\c{M}_1(Y)$
Condition on observed rewards $\gamma_1 = y_1,..,\gamma_{t-1} = y_{t-1}$
Draw a sample from the posterior, and play $$ x_t = \argmax_{x\in X} \sum_{\tau=1}^{t-1} y_\tau(x) + \sum_{\tau=t}^T \gamma_\tau(x) $$

Algorithm: Thompson sampling

Bayesian Algorithms for Online Learning

Thompson sampling: FTPL with very specific learning rates

Why should this be a good idea?

Result (Gravin, Peres, and Sivan, 2016). If $X = \{1,2,3\}$, the (infinite-horizon discounted) online learning game's Nash equilibrium is Thompson sampling with respect to a certain optimal prior.

Conjecture: Thompson sampling with strong priors is minimax-strong

This work: prove the finite and simplest infinite-dimensional case

Idea: Analyze Bayesian Algorithm in a Bayesian Way

Regret decomposition: $$ \ubr{R(p,y)}{ \htmlClass{fragment}{ \t{regret} } } = \ubr{R(p,q^{(\gamma)})}{ \htmlClass{fragment}{ \t{prior regret} } } + \ubr{E_{q^{(\gamma)}}(p,y)}{ \htmlClass{fragment}{ \t{excess regret} } } $$ where $$ E_{q^{(\gamma)}}(p,y) = \sum_{t=1}^T \Gamma^*_{t+1}(y_{1:t}) - \Gamma^*_t(y_{1:t-1}) - \pair{y_t}{p_t} + \E\pair{\gamma_t}{\smash{p_t^{(\gamma)}}} $$ and $\displaystyle\Gamma^*_t(f) = \E\sup_{x\in X} f(x) + \gamma_{t:T}(x)$.

Strong Priors for Adversarial Feedback

Strong prior over adversary:

Equalizing: $\E \gamma(x) = \E \gamma(x')$ for all $x,x'$
Certifies a sharp regret lower bound $$ R(\.,q^{(\gamma)}) \leq \min_p \max_q R(p,q) $$

Practical approximation: Gaussian process with matching smoothness

Matérn priors: widely-used in Bayesian optimization today
Straightforward and well-understood numerics

A Bregman Divergence Bound on Excess Regret

For a strong prior: $$ E_{q^{(\gamma)}}(p,y) \leq \sum_{t=1}^T \ubr{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}{\substack{\t{Bregman divergence}\\\t{induced by}\Gamma^*_t}} $$

Gaussian adversary: rates

Finite $X$ with $\ell^\infty$ adversary: $R(p,q) \leq 2\sqrt{T\log N}$
$X=[0,1]^d$, BL adversary: $R(p,q) \leq \small\c{O}\del{\beta\sqrt{Td\log(1+\sqrt{d}\frac{\lambda}{\beta})}}$

Hessian Bounds for Gaussian Adversaries

Taylor form of the Bregman divergence: $$ \begin{gathered} \htmlClass{fragment}{ D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1}) = \frac{1}{2} \int_0^1 \ubr{\p^2_{y_t,y_t} \Gamma^*_t(y_{1:t-1} + \alpha y_t)}{ \htmlClass{fragment}{ \t{Gâteaux Hessian} } } \d\alpha } \\ \htmlClass{fragment}{ \p^2_{u,v} \Gamma^*_t(f) = \frac{1}{\sqrt{T-t+1}} \E u(\ubr{x^*_{f+\gamma_{t:T}}}{ \htmlClass{fragment}{ \substack{\t{perturbed}\\\t{maximizer}} } }) \pair{\gamma_{t:T}}{\smash{\ubr{\c{K}^{-1}\vphantom{x^*_f}}{ \htmlClass{fragment}{ \substack{\t{covariance}\\\t{operator}} } }}v} } \end{gathered} $$

Classical finite-dimensional argument: $\norm{\Gamma^*_t(f)}_{L_{\infty,1}} \leq 2 \tr \Gamma^*_t(f)$

Only works if $\c{K}$ is the identity matrix
Essentially all obvious workarounds give vacuous rates

Probabilistic Hessian Bounds

Idea: condition on maximizer and work with truncated normals

Algebraic properties from discrete case have probabilistic analogs

Theorem. Suppose adversary satisfies $\norm{y}_\infty \leq \beta$ and prior is IID over time with constant variance $k(x,x) = \sigma^2$. Suppose that $$ \sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq C_{Y,k} \del{1 - \frac{k(x,x')}{k(x',x')}} . $$ Then we have $$ D_{\Gamma^*_t}(y_{1:t} \from y_{1:t-1})\leq \frac{\beta^2 + \beta C_{Y,k}}{2\sigma^2\sqrt{T-t+1}} \E \sup_{x\in X} \gamma_{t:T}(x) . $$

Bounded Lipschitz Adversary

Bounded Lipschitz ($\beta,\lambda$) adversary: Matérn kernel with length scale $\kappa$ $$ \sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq \ubr{\frac{\beta (\lambda + \frac{1}{\kappa})}{\frac{1}{\kappa} \del{1 - e^{-\frac{2}{\lambda\kappa+1}}}}}{ \htmlClass{fragment}{ C_{Y,k} } } \del{1 - \frac{k(x,x')}{k(x',x')}} $$

Thompson sampling regret: $$ R(p,q) \leq \beta \del{32 + \frac{32}{1 - \frac{1}{e}}}\sqrt{T d\log\del{1 + \sqrt{d}\frac{\lambda}{\beta}}} $$

Takeaways

Algorithmic design principle:

To explore by random actions, don't be too predictable, and match smoothness

Non-discrete online learning:

Thompson sampling: perturbation based algorithm
Bayesian viewpoint: match smoothness using Gaussian priors
Analysis: probabilistic argument for non-discrete Hessian bounds

Future work:

More general smoothness classes
Learning in games
Bandit feedback

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.