Research interests: decision-making under uncertainty
This tutorial: Gittins indices for Bayesian optimization
Talk for job market showcase: Bayesian algorithms for adversarial online learning
Both slides are available on my website!
Part I: Introducing Gittins Indices via Pandora's Box
Part II: Gittins Indices for Bayesian Optimization
Part III: Tail Scheduling
Goal: optimize unknown function $f$ in as few evaluations as possible
Algorithms have hyperparameters! Neural network training:
Goal: maximize test accuracy under expected compute budget $$ \htmlClass{fragment}{ \begin{aligned} &\max_{t=1,..,T} f(x_t) & &\t{subject to} & &\E\sum_{\smash{t=1}}^{\smash{T}} c(x_t) \leq B \end{aligned} } $$
Cost-aware baseline: expected improvement per unit cost $$ \begin{aligned} \htmlClass{fragment}{ \alpha_{t}^{\f{EIPC}}(x) } & \htmlClass{fragment}{ = \frac{\f{EI}_{f\given y_1,..,y_t}(x; \max_{1\leq\tau\leq t} y_\tau)}{c(x)} } & \htmlClass{fragment}{ \f{EI}_\psi(x; y) } & \htmlClass{fragment}{ = \E \max(0, \psi(x) - y) } \end{aligned} $$
Often strong in practice, but can perform arbitrarily-badly
Derivation: one-step approximation to intractable dynamic program
Assumptions:
These are restrictive! But they lead to an interesting, general solution
Theorem (Weitzman, 1979). This policy is optimal in expectation.
Caveat! Optimality theorems are fragile. Definitions are not!
Gittins index $\alpha^\star$: optimal for cost-per-sample problem
Theorem (Aminian et al., 2024; Xie et al., 2024). Assume the expected budget constraint is feasible and active. Then there exists a $\lambda \gt 0$ and a tie-breaking rule such that the policy defined by maximizing the Gittins index acquisition function $\alpha^\star(\.)$, defined using costs $\lambda c(x)$, is Bayesian-optimal.
Proof idea: Lagrangian duality
Bayesian optimization: posterior distribution is correlated
Define Pandora's Box Gittins Index acquisition function:
$\alpha^{\f{PBGI}}(x) = g\quad$ where $g$ solves $\quad\f{EI}_{f\given y}(x; g) = \htmlClass{fragment}{\ubr{\lambda}{\htmlClass{fragment}{\substack{\t{Lagrange multiplier}\\\t{from budget constraint}}}}} c(x)$
Correlations: incorporated into acquisition function via the posterior
Not an optimal policy. But, recall the caveat: maybe a strong one?
Performance counterexample for EIPC baseline:
 
 
 
Neural network training: proceeds over time
New goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$
Cost-aware freeze-thaw Bayesian optimization: this setting is novel!
Transient Markov chain: closed box $\leadsto$ open box $\leadsto$ selected box
State space: $S = \{\boxtimes\}\u\R\u\{\checkmark\}$, with terminal states $\p S = \{\checkmark\}$
Transition kernel: $p : S \to \c{P}(S)$
Reward function: $r : S \takeaway \p S \to \R$
Definition. Given mutually independent Markov chains $(S_i,\p S_i, p_i, r_i)$ for $i=1,..,n$, define a Markov decision process:
Unifies Pandora's Box with discounted bandits, various queues, ...
Discounted case: reduces to undiscounted case
Pandora's Box: solved by considering one closed and one open box
Let's generalize this!
Definition. Given a Markov chain $(S,\p S, r, p)$, an alternative option $\alpha\in\R$, and an initial state $s\in S$, define the $(s,\alpha)$-local MDP:
Definition. Given a Markov chain $(S,\p S, r, p)$, define its Gittins index $G : S \to \R\u\{\infty\}$ to be the unique number $g$ for which $\triangleright$ and $\htmlClass{footnotesize}{\htmlStyle{font-weight:bold;}{\square}}$ are co-optimal actions in the $(\alpha,s)$-local MDP (or $\infty$ if no such $g$ exists).
Theorem. In MCS, choosing the Markov chain of maximal Gittins index is optimal.
Caveat! Optimality is fragile. Definitions are not!
Example goal: maximize test accuracy, allowing for early stopping $$ \htmlClass{fragment}{ \begin{aligned} &\max_{k=1,..,K} f(x_{k,T_k}) & &\t{subject to} & &\E\sum_{\smash{k=1}}^{\smash{K}} \sum_{\smash{t=1}}^{\smash{T_k}} c(x_{t,k}) \leq B \end{aligned} } $$
Admits a well-defined Gittins index:
Open challenges: non-discrete numerics, regret analysis, novel applications