Talk for ML Scholars Workshop 2026

Cost-aware Bayesian Optimization via the Pandora's Box Gittins Index

Joint work with Qian Xie, Linda Cai, Theo Brown, Raul Astudillo, Peter Frazier, and Ziv Scully

Alexander Terenin

https://avt.im/ · @avt_im

Bayesian Optimization

Automatic explore-exploit tradeoff

Performance impact

Cost-aware Bayesian optimization

Motivating application: neural network hyperparameter tuning

  • $x$: number of layers, width, learning rate, ..
  • $f(x)$: test accuracy of trained model
  • $c(x)$: total training compute

Goal: maximize test accuracy under expected compute budget $$ \htmlClass{fragment}{ \begin{aligned} &\max_{t=1,..,T} f(x_t) & &\t{subject to} & &\E\sum_{\smash{t=1}}^{\smash{T}} c(x_t) \leq B \end{aligned} } $$

Research question: how to choose $\alpha(x)$?

Broader goal: from fundamental perspective, how should we think about questions like this?

Prior work

Expected improvement per unit cost $$ \begin{aligned} \htmlClass{fragment}{ \alpha_{t}^{\f{EIPC}}(x) } & \htmlClass{fragment}{ = \frac{\f{EI}_{f\given y_1,..,y_t}(x; \max_{1\leq\tau\leq t} y_\tau)}{c(x)} } & \htmlClass{fragment}{ \f{EI}_\psi(x; y) } & \htmlClass{fragment}{ = \E \max(0, \psi(x) - y) } \end{aligned} $$

Direct extension of non-cost-aware state-of-the-art

  • Often works in practice, but can perform arbitrarily-badly

Challenge: what should one do instead?

My work: new cost-aware acquisition function with strong performance

Key insight: novel connection to a fundamental decision problem

Simplifying cost-aware Bayesian optimization

Two assumptions:

  • Cost-per-sample setting: reward is $f(x_T) - \sum_{t=1}^T c(x_t)$
  • $X$ is discrete, $f(x)$ and $f(x')$ for $x \neq x'$ are independent

replaces budget constraint

looks really restrictive, but leads to an interesting general solution

This setting: Pandora's Box!

Whether to open Pandora's Box?

Solving Pandora's Box

Key difficulty: rewards are stochastic – boxes are hard to compare

Approach: compute a deterministic fair value

Consider: one closed vs. one open box

Should we open the closed box?

Small $g$: yes

Large $g$: no

Consider: one closed vs. one open box

If both opening and not opening are co-optimal: $g$ is a fair value

Define: $\alpha^\star(x) = g$ where $g$ solves $\f{EI}_f(x; g) = c(x)$

co-optimality equation

Gittins index

Back to many boxes

Theorem (Weitzman, 1979). This policy is optimal in expectation.

Caveatnot true with correlated boxes!

...but the definition of a Gittins index still makes sense

Cost-aware Bayesian Optimization:

Correlated Pandora's Box?

Additional difference: expected budget constraint

Expected budget constraint vs. cost per sample

Gittins index $\alpha^\star$: optimal for cost per sample problem

  • What about expected budget constraint?

Theorem. Assume the expected budget constraint is feasible and active. Then there exists a $\lambda \gt 0$ and a tie-breaking rule such that the policy defined by maximizing the Gittins index acquisition function $\alpha^\star(\.)$, defined using costs $\lambda c(x)$, is Bayesian-optimal.

Proof idea: Lagrangian duality

Pandora's Box Gittins Index for Bayesian Optimization

Bayesian optimization: posterior distribution is correlated

Define the Pandora's Box Gittins Index acquisition function:

$\alpha^{\f{PBGI}}_t(x) = g\quad$ where $g$ solves $\quad\f{EI}_{\htmlClass{anchor-1}{f\given y_1,..,y_t}}(x; g) = \htmlClass{anchor-2}{\lambda}\,c(x)$

correlatedposterior

Largange multiplierfrom budget constraint

What is $\lambda$ in $\f{EI}_{f\given y_1,..,y_t}(x; g) = \lambda\,c(x)$ doing?

Controls risk-averse vs. risk-seeking behavior

should be tuned to match budget

Limit as $\lambda\to0$: converges to UCB

with automatic learning rate

Does it work?

Not an optimal policy. But, recall the caveat: maybe a strong one?

Back to our counterexample

Looks promising! What about in practice?

Objective functions: sampled from the prior

Objective functions: sampled from the prior

Synthetic benchmark functions

Ackley function

Levy function

Rosenbrock function

Synthetic benchmark functions

Empirical objectives

Empirical objectives

Conclusion: Gittins indices can perform well, even where they are not optimal

Lessons for bigger picture: role of costs and budget, early stopping, and more...

this work

follow-up

Current and Future Work:  Gittins Indices

Current and Future Work: Gittins Indices

Neural network training: proceeds over time

  • Can sometimes predict final test loss from early iterations
  • Why finish a training run we know will turn out bad?

New goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$

Cost-aware freeze-thaw Bayesian optimization: this setting is novel!

Pandora's Box: what's this an example of?

Transient Markov chain: closed box $\leadsto$ open box $\leadsto$ selected box

Markov Chain Selection

Instead of choosing between Pandora's Boxes, we choose between Markov chains!

The Local MDP

Pandora's Box: solved by considering one closed and one open box

This is the general form of the Gittins index!

The Gittins Index of a Local MDP

Theorem. In MCS, choosing the Markov chain of maximal Gittins index is optimal.

Caveat! Optimality is fragile. Definitions are not!

Gittins indices for advanced variants of Bayesian optimization

Example goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$

Admits a well-defined Gittins index:

  • Discrete simplifying case: special case of MCS – Gittins is optimal!
  • Gaussian process: correlations – not optimal. But maybe strong?

open problem: regret analysis

open problem: non-discrete numerics

Long-term question: how do we generate synthetic data for training decision foundation models?

for applications in different domains

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

Q. Xie, R. Astudillo, P. Frazier, Z. Scully, and A. Terenin. Cost-aware Bayesian optimization via the Pandora's Box Gittins index. NeurIPS, 2024. INFORMS Data Mining Best Paper Finalist.

Q. Xie, L. Cai, A. Terenin, P. Frazier, and Z. Scully. Cost-aware Stopping for Bayesian Optimization. In review at NeurIPS, 2025.

Z. Scully and A. Terenin. Gittins Index: A Design Principle for Decision-making Under Uncertainty. INFORMS Tutorials in Operations Research, 2025.