Talk for ITA 2026
Joint work with Jeff Negrea
Alexander Terenin
Many problems in statistical learning and game theory:
Learning in games via repeated play
Sample-efficient reinforcement learning
Confidence sets and generalization bounds
Under-the-hood technical tool: adversarial online learning
This work: continuous action spaces
Adversary behaves arbitrarily $\leadsto$ actions need to be randomized
| Discrete Actions | Continuous Actions | |
| Representation | Array of probabilities | Unnormalized density? |
| Sampling | Straightforward | Markov Chain Monte Carlo? |
My work: continuous action online learning via correlated FTPL
Key insight: novel Bayesian perspective
New result: novel algorithm and regret bound for bounded Lipschitz online learning
New analysis technique: probabilistic bounds for follow-the-perturbed-leader with correlated Gaussians
Approach: (i) develop a Bayesian point of view, and then (ii) use it to understand smoothness
At each time $t = 1,..,T$:
Regret: $$ R(p,q) = \E_{\substack{x_t\~p_t\\y_t\~q_t}} \htmlClass{anchor-1}{\sup_{x\in X} \sum_{t=1}^{\smash{T}} y_t(x)} - \htmlClass{anchor-2}{\sum_{t=1}^{\smash{T}} y_t(x_t)} $$
single best action in hindsight
total reward of learner's actions
Standard approaches and challenges:
Core method design toolkit: convex optimization
Our starting point: follow-the-perturbed-leader (FTPL) methods
Start with adversary's perspective:
Start with adversary's perspective:
Asterisk: Central Limit Theorem approximation
Start with adversary's perspective:
Algorithm: FTPL in the form of Thompson sampling
Counterintuitive: we're positing a stochastic model, but in reality are playing a minimax game
Inspiration: if $X=\{1,2,3\}$ the Nash strategy has this form (Gravin, Peres, and Sivan, 2016)
Can the true adversary exploit this?
Regret: $\forall p$, $$ \htmlData{class=fragment,fragment-index=5}{ \htmlClass{anchor-1}{R(p,q^{(\gamma)})} \,\,\leq\,\, } \htmlData{class=fragment,fragment-index=3}{ \htmlClass{anchor-2}{R(p^{(\gamma)}_\infty,q)} } \htmlData{class=fragment,fragment-index=7}{ \,\,=\,\, \htmlClass{anchor-3}{R(p,q^{(\gamma)}_\infty)} \,\,+\,\, \htmlClass{anchor-4}{E_{q^{(\gamma)}_\infty}(p^{(\gamma)}_\infty,q)} } $$
strong adverary's lower bound
true regret of our algorithm
prior regret: what we thought adversary would do
excess regret: what adversary actually did
where:
the difficult term
The hard part: smooth adversaries lead to correlated priors
Our approach: follow Bayesian intuition $$ \begin{aligned} \htmlData{fragment-index=5,class=fragment}{ \t{linear algebra} } \quad & \htmlData{fragment-index=5,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=5,class=fragment}{ \t{conditional probability} } \\ \htmlData{fragment-index=6,class=fragment}{ \t{traces of matrices} } \quad & \htmlData{fragment-index=6,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=6,class=fragment}{ \t{conditional expectations} } \\ \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}L_{\infty,1} \t{matrix}\\[-1ex]\t{norm identities}\end{gathered} } \quad & \htmlData{fragment-index=7,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}\t{balance of probability}\\[-1ex]\t{in truncated Gaussians}\end{gathered} } \end{aligned} $$
New result: general criterion connecting adversary's smoothness with learner's prior and correlations
Sanity check: in classical setting, do we recover optimal rates?
Adversarial problem: not obvious Bayesian learning makes sense
Theorem. Let $X = [0,1]^d$, $Y$ bounded Lipschitz with $\norm{y}_\infty\leq\beta$ and $\abs{y}_{\f{Lip}}\leq\lambda$, and the prior is Matérn-$\frac{1}{2}$ with variance $\beta$ and length scale $\frac{1}{\lambda}$. Then Thompson sampling achieves a regret of $$ R(p,q) \leq \beta \del{32 + \frac{32}{1 - \frac{1}{e}}}\sqrt{T d\log\del{1 + \sqrt{d}\frac{\lambda}{\beta}}} . $$
Criterion: $\displaystyle\sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq C_{Y,k} \del{1 - \frac{k(x,x')}{k(x',x')}} , \quad C_{Y,k} = \frac{2\beta}{\del{1 - e^{-\frac{2\beta}{\lambda\kappa+\beta}}}}$
Exploration by random actions:
PPO variants, entropy regularization
historical focus: optimism
Adversarial formulation directly leads to randomized policies
even in adversarial setups which are not a priori Bayesian
Future directions:
Gaussian Process Scalability Highlight: ICML Best Paper Honorable Mention
Bayesian Optimization via Pandora's Box NeurIPS 2025
Cost-aware Early Stopping In review at ICML
Multi-stage Decision Making via Gittins Indices INFORMS tutorial, current
Geometric Gaussian Processes Highlight: AISTATS Best Student Paper
Non-discrete Online Learning via Thompson Sampling In review at JMLR
Online Learning in Extensive-form Games In review at COLT
A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.