INFORMS Job Market Showcase

Bayesian Algorithms

for Adversarial 

Online Learning

:

from Finite to Infinite Action Spaces

Alexander Terenin

https://avt.im/ · @avt_im

Announcement: INFORMS Tutorial

The Gittins Index: A Design Principle for Decision-making Under Uncertainty

Monday 2:45pm: Building A, Level 4, Room A410

INFORMS Job Market Showcase

Bayesian Algorithms

for Adversarial 

Online Learning

:

from Finite to Infinite Action Spaces

Alexander Terenin

https://avt.im/ · @avt_im

Online Learning

Many problems in statistical learning and game theory:

Learning in games via repeated play

Sample-efficient reinforcement learning

Confidence sets and generalization bounds

Under-the-hood technical tool: adversarial online learning

This talk: non-discrete action spaces

Challenges with Non-discrete Actions

Adversary behaves arbitrarily $\leadsto$ actions need to be randomized

Discrete Actions Continuous Actions
Representation Array of probabilities Unnormalized density?
Sampling Straightforward Markov Chain Monte Carlo?

My work: continuous online learning via correlated FTPL

Key insight: novel Bayesian perspective

Bayesian Algorithms for Adversarial Online Learning

New result: novel algorithm and regret bound for bounded Lipschitz online learning

New analysis technique: probabilistic bounds for follow-the-perturbed-leader with correlated Gaussians

Approach: (i) develop a Bayesian point of view, and then (ii) use it to understand smoothness

Joint work with Jeff Negrea

The Online Learning Game

At each time $t = 1,..,T$:

  1. Learner picks a random action $x_t \~ p_t \in \c{M}_1(X)$.
  2. Adversary responds with a reward function $y_t : X \to \R$, chosen adaptively from a given function class $Y$.
  3. Full feedback: the learner observes $y_t(x)$ for all $x$.

Regret: $$ R(p,q) = \E_{\substack{x_t\~p_t\\y_t\~q_t}} \ubr{\sup_{x\in X} \sum_{t=1}^{\smash{T}} y_t(x)}{ \htmlClass{fragment}{ \substack{\t{single best action}\\\t{in hindsight}} } } - \ubr{\sum_{t=1}^{\smash{T}} y_t(x_t)}{ \htmlClass{fragment}{ \substack{\t{total reward of}\\\t{learner's actions}} } } $$

Non-discrete Online Learning

Actions Learner'sDistribution Rewards

Typical Online Learning Algorithms

Standard approaches and challenges:

  • Exponential weights: requires sampling via MCMC
  • Adaptive discretization: scalability with dimension
  • Mirror descent: requires non-obvious choice of a mirror map

Core method design toolkit: convex optimization

Our starting point: follow-the-perturbed-leader (FTPL) methods

  • Advantage: computation involves optimization, not sampling
  • Challenge: handling smoothness
  • Contributions: (i) algorithm design and (ii) algorithm analysis

Key Idea: Algorithm Design

Start with adversary's perspective:

  1. Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary

Key Idea: Algorithm Design

Start with adversary's perspective:

  1. Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary
  2. Pretend* adversary will follow this strategy: $\gamma_1,..,\gamma_T \~ \smash{q^{(\gamma)}_{\htmlData{fragment-index=6,class=fragment}{\infty}}}$

Asterisk: Central Limit Theorem approximation

Key Idea: Algorithm Design

Start with adversary's perspective:

  1. Find a strong strategy $q^{(\gamma)}\in\c{M}_1(Y)$ for the adversary
  2. Pretend* adversary will follow this strategy: $\gamma_1,..,\gamma_T \~ \smash{q^{(\gamma)}_\infty}$
  3. Play random actions using (Bayesian) conditional best response $$ \htmlClass{fragment}{ x_t = \argmax_{x\in X} \sum_{\tau=1}^{t-1} y_\tau(x) + \sum_{\tau=t}^T \gamma_\tau(x) } $$

Algorithm: FTPL in the form of Thompson sampling

Inspiration: if $X=\{1,2,3\}$ the Nash strategy has this form (Gravin, Peres, and Sivan, 2016)

Can the true adversary exploit this?

Approach: Analyze Bayesian Algorithm in a Bayesian Way

Regret: $\forall p$, $$ \ubr{R(p,q^{(\gamma)}) \vphantom{E_{q^{(\gamma)}_\infty}}}{ \htmlClass{fragment}{ \substack{\t{strong adversary's}\\[0.2ex]\t{lower bound}} } } \,\,\leq\,\, \ubr{R(p^{(\gamma)}_\infty,q) \vphantom{E_{q^{(\gamma)}_\infty}}}{ \htmlClass{fragment}{ \substack{\t{true regret of}\\[0.2ex]\t{our algorithm}} } } \htmlClass{fragment}{ \,\,=\,\, \ubr{R(p,q^{(\gamma)}_\infty) \vphantom{E_{q^{(\gamma)}_\infty}}}{ \htmlClass{fragment}{ \substack{\t{prior regret:}\\[0.2ex]\t{what we thought}\\[0.2ex]{\t{adversary would do}}} } } \,\,+\,\, \ubr{E_{q^{(\gamma)}_\infty}(p^{(\gamma)}_\infty,q)}{ \htmlClass{fragment}{ \substack{\t{excess regret:}\\[0.2ex]\t{what adversary}\\[0.2ex]{\t{actually did}}} } } } $$ where:

  • Prior regret: $R(p,q^{(\gamma)}) \approx R(p,q^{(\gamma)}_\infty)$ by CLT, bound via chaining
  • Excess regret: bounded by Bregman divergence $\ubr{D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})}{\htmlClass{fragment}{\t{the difficult term}}}$

Technical Contribution: Local Norm Bounds for Correlated Gaussians

The hard part: smooth adversaries lead to correlated priors

  • Useful bounds on $D_{\Gamma^*_t}(y_{1:t}\from y_{1:t-1})$: open problem since Abernethy et al. (2014), even for correlated Gaussians!

Our approach: follow Bayesian intuition $$ \begin{aligned} \htmlData{fragment-index=5,class=fragment}{ \t{linear algebra} } \quad & \htmlData{fragment-index=5,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=5,class=fragment}{ \t{conditional probability} } \\ \htmlData{fragment-index=6,class=fragment}{ \t{traces of matrices} } \quad & \htmlData{fragment-index=6,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=6,class=fragment}{ \t{conditional expectations} } \\ \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}L_{\infty,1} \t{matrix}\\[-1ex]\t{norm identities}\end{gathered} } \quad & \htmlData{fragment-index=7,class=fragment}{ \leftrightsquigarrow } \quad \htmlData{fragment-index=7,class=fragment}{ \begin{gathered}\t{balance of probability}\\[-1ex]\t{in truncated Gaussians}\end{gathered} } \end{aligned} $$

New result: general criterion connecting adversary's smoothness with learner's prior and correlations

Classical Finite Adversary

Sanity check: in classical setting, do we recover optimal rates?

  • Prior: Gaussian with diagonal covariance matrix
  • Criterion: simplifies to $\sup_{\norm{y}_\infty\leq 1} y(x) \leq 1$
  • Regret bound: $R(p,q) \leq 4\sqrt{T\log N}$

Adversarial problem: not obvious Bayesian learning makes sense

  • This result: learner's prior not predictable $\leadsto$ strong algorithm

Bounded Lipschitz Adversary

Theorem. Let $X = [0,1]^d$, $Y$ bounded Lipschitz with $\norm{y}_\infty\leq\beta$ and $\abs{y}_{\f{Lip}}\leq\lambda$, and the prior is Matérn-$\frac{1}{2}$ with variance $\beta$ and length scale $\frac{1}{\lambda}$. Then Thompson sampling achieves a regret of $$ R(p,q) \leq \beta \del{32 + \frac{32}{1 - \frac{1}{e}}}\sqrt{T d\log\del{1 + \sqrt{d}\frac{\lambda}{\beta}}} . $$

Criterion: $\displaystyle\sup_{y\in Y} y(x) - y(x')\frac{k(x,x')}{k(x',x')} \leq C_{Y,k} \del{1 - \frac{k(x,x')}{k(x',x')}} , \quad C_{Y,k} = \frac{2\beta}{\del{1 - e^{-\frac{2\beta}{\lambda\kappa+\beta}}}}$

  • Simple and easy to check
  • Exact correlation structure is critical

Takeaways and future work

Fundamental theoretical insights:

  • To explore by random actions, consider a Bayesian perspective
  • To protect against an adversary, use priors that are not predictable
  • To handle non-discrete rewards, match smoothness

Next steps:

  • Equilibrium computation in non-discrete-action games
  • Applications in economics

Longer-term:

  • Non-discrete adversarial bandits and reinforcement learning
  • Applications in continuous control, and more broadly

Thank you!

https://avt.im/· @avt_im

A. Terenin and J. Negrea. An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces. arXiv:2502.14790, 2025.

Reminder: check out our Gittins Index tutorial! Monday 2:45pm: Building A, Level 4, Room A410