Mini-tutorial for Dagstuhl on Bayesian Optimization
Alexander Terenin
Automatic explore-exploit tradeoff
Goal: optimize unknown function $f$ in as few evaluations as possible
Useful property: separation of modeling from decision-making
Algorithms have hyperparameters! Neural network training:
Goal: maximize test accuracy under expected compute budget $$ \htmlClass{fragment}{ \begin{aligned} &\max_{t=1,..,T} f(x_t) & &\t{subject to} & &\E\sum_{\smash{t=1}}^{\smash{T}} c(x_t) \leq B \end{aligned} } $$
Cost-aware baseline: expected improvement per unit cost $$ \begin{aligned} \htmlClass{fragment}{ \alpha_{t}^{\f{EIPC}}(x) } & \htmlClass{fragment}{ = \frac{\f{EI}_{f\given y_1,..,y_t}(x; \max_{1\leq\tau\leq t} y_\tau)}{c(x)} } & \htmlClass{fragment}{ \f{EI}_\psi(x; y) } & \htmlClass{fragment}{ = \E \max(0, \psi(x) - y) } \end{aligned} $$
Often strong in practice, but can perform arbitrarily-badly
Derivation: one-step approximation to intractable dynamic program
Expected improvement: time-based simplification
Gittins index: space-based simplification
Assumptions:
These are restrictive! But they lead to an interesting, general solution
This is just Pandora's Box!
Should we open the closed box? Maybe!
Depends on costs $c$, reward distribution $f$, and value of open box $g$
One closed vs. open box: Markov decision process
Optimal policy: open if $\f{EI}_f(x; g) \gt c(x)$
Consider how optimal policy changes as a function of $g$
If both opening and not opening are co-optimal: $g$ is a fair value
Define: $\alpha^\star_t(x) = g$ where $g$ solves $\f{EI}_f(x; g) = c(x)$
Theorem (Weitzman, 1979). This policy is optimal in expectation.
Caveat! Optimality theorems are fragile. Definitions are not!
Difference so far: expected budget constraint
Gittins index $\alpha^\star$: optimal for cost-per-sample problem
Theorem (Aminian et al., 2024; Xie et al., 2024). Assume the expected budget constraint is feasible and active. Then there exists a $\lambda \gt 0$ and a tie-breaking rule such that the policy defined by maximizing the Gittins index acquisition function $\alpha^\star(\.)$, defined using costs $\lambda c(x)$, is Bayesian-optimal.
Proof idea: Lagrangian duality
Bayesian optimization: posterior distribution is correlated
Define Pandora's Box Gittins Index acquisition function:
$\alpha^{\f{PBGI}}(x) = g\quad$ where $g$ solves $\quad\f{EI}_{f\given y}(x; g) = \htmlClass{fragment}{\ubr{\lambda}{\htmlClass{fragment}{\substack{\t{Lagrange multiplier}\\\t{from budget constraint}}}}} c(x)$
Correlations: incorporated into acquisition function via the posterior
Not an optimal policy. But, recall the caveat: maybe a strong one?
Performance counterexample for EIPC baseline:
Pandora's Box Gittins Index: performs well
Controls risk-averse vs. risk-seeking behavior
Optimal tuning of $\lambda$: depends on the expected budget
Limit as $\lambda\to0$: converges to UCB with automatic learning rate
Ackley function
Levy function
Rosenbrock function
Neural network training: proceeds over time
New goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$
Cost-aware freeze-thaw Bayesian optimization: this setting is novel!
Transient Markov chain: closed box $\leadsto$ open box $\leadsto$ selected box
State space: $S = \{\boxtimes\}\u\R\u\{\checkmark\}$, with terminal states $\p S = \{\checkmark\}$
Transition kernel: $p : S \to \c{P}(S)$
Reward function: $r : S \takeaway \p S \to \R$
Definition. Given mutually independent Markov chains $(S_i,\p S_i, p_i, r_i)$ for $i=1,..,n$, define a Markov decision process:
Unifies Pandora's Box with discounted bandits, various queues, ...
Discounted case: reduces to undiscounted case
Pandora's Box: solved by considering one closed and one open box
Let's generalize this!
Definition. Given a Markov chain $(S,\p S, r, p)$, an alternative option $\alpha\in\R$, and an initial state $s\in S$, define the $(s,\alpha)$-local MDP:
Definition. Given a Markov chain $(S,\p S, r, p)$, define its Gittins index $G : S \to \R\u\{\infty\}$ to be the unique number $g$ for which $\triangleright$ and $\htmlClass{footnotesize}{\htmlStyle{font-weight:bold;}{\square}}$ are co-optimal actions in the $(\alpha,s)$-local MDP (or $\infty$ if no such $g$ exists).
Theorem. In MCS, choosing the Markov chain of maximal Gittins index is optimal.
Caveat! Optimality is fragile. Definitions are not!
Example goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$
Admits a well-defined Gittins index:
Open challenges: non-discrete numerics, regret analysis, novel applications
Q. Xie, R. Astudillo, P. Frazier, Z. Scully, and A. Terenin. Cost-aware Bayesian optimization via the Pandora's Box Gittins index. NeurIPS, 2024. INFORMS Data Mining Best Paper Finalist.
Q. Xie, L. Cai, A. Terenin, P. Frazier, and Z. Scully. Cost-aware Stopping for Bayesian Optimization. In review at NeurIPS, 2025.
Z. Scully and A. Terenin. Gittins Index: A Design Principle for Decision-making Under Uncertainty. INFORMS Tutorials in Operations Research, 2025.