Talk for ITA 2026

Bayesian Algorithms for Adversarial Online Learning: from Finite to Infinite Action Spaces

Joint work with Jeff Negrea

Alexander Terenin

https://avt.im/ · @avt_im

Online Learning

Many problems in statistical learning and game theory:

Learning in games via repeated play

Sample-efficient reinforcement learning

Confidence sets and generalization bounds

Under-the-hood technical tool: adversarial online learning

This work: continuous action spaces

Challenges with Non-discrete Actions

Adversary behaves arbitrarily $\leadsto$ actions need to be randomized

	Discrete Actions	Continuous Actions
Representation	Array of probabilities	Unnormalized density?
Sampling	Straightforward	Markov Chain Monte Carlo?

My work: continuous action online learning via correlated FTPL

Key insight: novel Bayesian perspective

Bayesian Algorithms for Adversarial Online Learning

New result: novel algorithm and regret bound for bounded Lipschitz online learning

New analysis technique: probabilistic bounds for follow-the-perturbed-leader with correlated Gaussians

Approach: (i) develop a Bayesian point of view, and then (ii) use it to understand smoothness

The Online Learning Game

At each time $t = 1,..,T$:

Learner picks a random action $x_t \~ p_t \in \c{M}_1(X)$.
Adversary responds with a reward function $y_t : X \to \R$, chosen adaptively from a given function class $Y$.
Full feedback: the learner observes $y_t(x)$ for all $x$.

Regret: $$ R(p,q) = \E_{\substack{x_t\~p_t\\y_t\~q_t}} \htmlClass{anchor-1}{\sup_{x\in X} \sum_{t=1}^{\smash{T}} y_t(x)} - \htmlClass{anchor-2}{\sum_{t=1}^{\smash{T}} y_t(x_t)} $$

single best action in hindsight

total reward of learner's actions

Non-discrete Online Learning

Typical Online Learning Algorithms

Standard approaches and challenges:

Exponential weights: requires sampling via MCMC
Adaptive discretization: scalability with dimension
Mirror descent: requires non-obvious choice of a mirror map

Core method design toolkit: convex optimization

Our starting point: follow-the-perturbed-leader (FTPL) methods

Advantage: computation involves optimization, not sampling
Challenge: handling smoothness
Contributions: (i) algorithm design and (ii) algorithm analysis

Key Idea: Algorithm Design

Start with adversary's perspective:

Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary

Key Idea: Algorithm Design

Start with adversary's perspective:

Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary
Pretend^* adversary will follow this strategy: $\gamma_1,..,\gamma_T \~ \smash{q^{(\gamma)}_{\htmlData{fragment-index=6,class=fragment}{\infty}}}$

Asterisk: Central Limit Theorem approximation

Key Idea: Algorithm Design

Start with adversary's perspective:

Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary
Pretend^* adversary will follow this strategy: $\gamma_1,..,\gamma_T \~ \smash{q^{(\gamma)}_\infty}$
Play random actions using (Bayesian) conditional best response $$ \htmlClass{fragment}{ x_t = \argmax_{x\in X} \sum_{\tau=1}^{t-1} y_\tau(x) + \sum_{\tau=t}^T \gamma_\tau(x) } $$

Algorithm: FTPL in the form of Thompson sampling

Counterintuitive: we're positing a stochastic model, but in reality are playing a minimax game

Inspiration: if $X=\{1,2,3\}$ the Nash strategy has this form (Gravin, Peres, and Sivan, 2016)

Can the true adversary exploit this?

Approach: Analyze Bayesian Algorithm in a Bayesian Way

Regret: $\forall p$, $$ \htmlData{class=fragment,fragment-index=5}{ \htmlClass{anchor-1}{R(p,q^{(\gamma)})} \,\,\leq\,\, } \htmlData{class=fragment,fragment-index=3}{ \htmlClass{anchor-2}{R(p^{(\gamma)}_\infty,q)} } \htmlData{class=fragment,fragment-index=7}{ \,\,=\,\, \htmlClass{anchor-3}{R(p,q^{(\gamma)}_\infty)} \,\,+\,\, \htmlClass{anchor-4}{E_{q^{(\gamma)}_\infty}(p^{(\gamma)}_\infty,q)} } $$

strong adverary's lower bound

true regret of our algorithm

prior regret: what we thought adversary would do

excess regret: what adversary actually did

where:

Prior regret: $R(p,q^{(\gamma)}) \approx R(p,q^{(\gamma)}_\infty)$ by CLT, bound via chaining
Excess regret: bounded by Bregman divergence $\htmlClass{anchor-5}{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}$

the difficult term

Technical Contribution: Local Norm Bounds for Correlated Gaussians

The hard part: smooth adversaries lead to correlated priors

Useful bounds on $D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})$: open problem since Abernethy et al. (2014), even for correlated Gaussians!

Our approach: follow Bayesian intuition $$ \begin{aligned} \htmlData{fragment-index=5,class=fragment}{ \t{linear algebra} } \quad & \htmlData{fragment-index=5,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=5,class=fragment}{ \t{conditional probability} } \\ \htmlData{fragment-index=6,class=fragment}{ \t{traces of matrices} } \quad & \htmlData{fragment-index=6,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=6,class=fragment}{ \t{conditional expectations} } \\ \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}L_{\infty,1} \t{matrix}\\[-1ex]\t{norm identities}\end{gathered} } \quad & \htmlData{fragment-index=7,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}\t{balance of probability}\\[-1ex]\t{in truncated Gaussians}\end{gathered} } \end{aligned} $$

New result: general criterion connecting adversary's smoothness with learner's prior and correlations

Classical Finite Adversary

Sanity check: in classical setting, do we recover optimal rates?

Prior: Gaussian with diagonal covariance matrix
Criterion: simplifies to $\sup_{\norm{y}_\infty\leq 1} y(x) \leq 1$
Regret bound: $R(p,q) \leq 4\sqrt{T\log N}$

Adversarial problem: not obvious Bayesian learning makes sense

This result: learner uses strong prior $\leadsto$ strong algorithm

Bounded Lipschitz Adversary

Theorem. Let $X = [0,1]^d$, $Y$ bounded Lipschitz with $\norm{y}_\infty\leq\beta$ and $\abs{y}_{\f{Lip}}\leq\lambda$, and the prior is Matérn-$\frac{1}{2}$ with variance $\beta$ and length scale $\frac{1}{\lambda}$. Then Thompson sampling achieves a regret of $$ R(p,q) \leq \beta \del{32 + \frac{32}{1 - \frac{1}{e}}}\sqrt{T d\log\del{1 + \sqrt{d}\frac{\lambda}{\beta}}} . $$

Criterion: $\displaystyle\sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq C_{Y,k} \del{1 - \frac{k(x,x')}{k(x',x')}} , \quad C_{Y,k} = \frac{2\beta}{\del{1 - e^{-\frac{2\beta}{\lambda\kappa+\beta}}}}$

Simple and easy to check
Exact correlation structure is critical

What do we learn from this?

Exploration by random actions:

Overwhelmingly used in empirical RL yet less prominent in theory

PPO variants, entropy regularization

historical focus: optimism

Adversarial formulation directly leads to randomized policies

This work: the fundamental design principle might be Bayes' Rule
Reason 1: works with smoothness
Reason 2: computationally practical

even in adversarial setups which are not a priori Bayesian

Future directions:

General forms of feedback
In-context decision-making

One last thing: I'm on the faculty job market!

Gaussian Process Scalability Highlight: ICML Best Paper Honorable Mention

Bayesian Optimization via Pandora's Box NeurIPS 2025

Cost-aware Early Stopping In review at ICML

Multi-stage Decision Making via Gittins Indices INFORMS tutorial, current

Geometric Gaussian Processes Highlight: AISTATS Best Student Paper

Non-discrete Online Learning via Thompson Sampling In review at JMLR

Online Learning in Extensive-form Games In review at COLT

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.