Talk for Dagstuhl on Bayesian Optimization

How are transformers and potential AGI affecting Bayesian optimization research?

Alexander Terenin

https://avt.im/ · @avt_im

Bayesian optimization

Automatic explore-exploit tradeoff

Today, a nontrivial fraction of ML researchers think AGI is imminent

One can certainly argue they're wrong
What if they're right?

I won't answer these questions directly

Goal: make them precise and technical
Approach: Transformer Theory

Long-run goal: framework for thinking about transformer-based tools in BayesOpt context

Transformers

Sequence-to-sequence models: $f : \ubr{S_n(\Sigma)}{\htmlClass{fragment}{\substack{\t{sequences}\\\t{of length}n}}} \to S_n\ubr{(\Sigma)}{\htmlClass{fragment}{\t{alphabet}}}$

Training via classification, sampling via softmax scores

What can one do using these capabilities?

Structure of transformers

One layer's (simplified) sequence-to-sequence map: $$ x_t^{(\ell+1)} = x_t^{(\ell)} + \ubr{f_{\f{MLP}}(x_t^{(\ell)})}{\htmlClass{fragment}{\t{token-by-token}}} \quad+\quad \ubr{f_{\f{attn}}(x_{1:t}^{(\ell)})}{\substack{\htmlClass{fragment}{\t{causally masked input seq.}}\\\htmlClass{fragment}{\t{with position encoding}}}} $$

$t$: token indices

$\ell$: layers

$x_t^{(\ell)}$: $d$-dimensional vector

Attention: involves an matrix of all token pairs – empirically sparse!

Transformer Theory: multi-hop reasoning tasks

This part: heavily inspired by Matus Telgarsky's short course on deep learning theory

From $1$-hop to $K$-hop

Definition. The induction head task or 1-hop task is a sequence-to-sequence mapping task$S_n(\Sigma) \to S_n(\Sigma\u\{\bot\})$ defined as follows. Given $(x_i)_{i=1}^n$, we compute $(y_i)_{i=1}^n$ element-by-element as follows:

Find previous occurence of $x_i$, namely $$ \max\{j \lt i : x_j = x_i\} $$
Let $$ y_i = \begin{cases} \htmlClass{fragment}{\bot} & \htmlClass{fragment}{\t{if previous occurrence does not exist}} \\ \htmlClass{fragment}{x_{j+1}} & \htmlClass{fragment}{\t{if previous occurrence exists.}} \end{cases} $$

$K$-hop: same task repeated recursively

Visualizing $1$-hop

Why $K$-hop?

At first looks contrived, and yet:

Captures the random access patterns seen in attention matrices
Resembles low-level linguistic tasks, such as reference lookups
Clearly solvable in $K+1$ layers

Also solvable in $2 + \lfloor\log_2 K\rfloor$ layers!

And, gradient descent learns this solution from data!

Transformers and $K$-hop

Theorem. There exists a transformer with $2 + \lfloor\log_2 K\rfloor$ layers, $\c{O}(\log(n) + \log(|\Sigma|))$ bit precision, where $n$ is the context window size and $\Sigma$ is the alphabet, one attention head, and representation dimension $d = \c{O}(1)$.

Proof idea: guess the attention matrices.

One can also show:

Lower bounds on how large the transformer needs to be to solve $1$-hop, via a clever reduction to a communication complexity statement
Including for $K$-hop, conditional on certain TCS conjectures
Memorization lower bounds for purely recurrent models

Significance: transformers can learn algorithms!

Transformers: technical capabilities

Significance: transformers can learn algorithms!

The Magic of Classification

From training to inference

Consider a transformer $f : \ubr{S_n(\Sigma)}{\htmlClass{fragment}{\t{sequences}}} \to \ubr{Y\vphantom{|}}{\htmlClass{fragment}{\t{classes}}}$

Goal of training: learn correct class

Optimization objective: cross-entropy loss

Result: learn full distribution $\ubr{p(y \given s_{1:n})}{\htmlClass{fragment}{\t{softmax scores}}}$ – wait, what?

Obvious, but unintuitive: square loss is not like this!

Critical for transformers' generative capabilities

Where is this coming from?

To build intuition, let's consider the trivial setting: $f : S_1(\{1\}) \to Y$

This setting: learning a distribution over a finite alphabet
Obvious estimator: empirical histogram
Cross-entropy minimization is perfectly well-defined: $$ L(\h{y};y) \htmlClass{fragment}{ = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{|Y|} y_{ik} \log \h{y}_k } $$

How does cross-entropy work here?

Solving cross-entropy minimization

Differentiating $L$: $$ \frac{\p}{\p\h{y}_j} L(\h{y}; y) \htmlClass{fragment}{ = - \frac{\p}{\p\h{y}_j}\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{|Y|} y_{ik} \log \h{y}_k } \htmlClass{fragment}{ = -\frac{1}{N} \sum_{i=1}^N \frac{y_{ij}}{\h{y}_j} } \htmlClass{fragment}{ = -\frac{\frac{1}{N} \sum_{i=1}^N y_{ij}}{\h{y}_j} } $$ Constraint: $\sum_{k=1}^{|Y|} \h{y}_k = 1$, solve via Lagrange multipliers

Result: $\displaystyle\frac{\sum_{i=1}^N y_{i(\.)}}{N} = \argmin_{\h{y}\in\c{M}_1(Y)} L(\h{y};y)$

Cross-entropy is a distribution-learning objective

What can transformers do?

Transformers are sequence-to-sequence models:

Before deep learning, wasn't obvious this was possible in practice

Transformers can internally represent algorithms:

$K$-hop: one of the simplest non-trivial algorithmic tasks

Transformers learn conditional distributions over discrete spaces:

Unlike square loss, cross-entropy does more than just classification

Implications for Bayesian Optimization

I don't know! But remember:

We started from the concept of AGI
The capabilities I've presented are precisely defined

These technical primitives are clearly very powerful

What can we build using them?

Some vague ideas:

Only BayesOpt community attempt I know: PFNs
Discrete BayesOpt beyond GPs?
Decision-theoretic techniques for transformer theory?

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

References

Anthropic. A Mathematical Framework for Transformer Circuits. 2021.

C. Sanford, D. Hsu, M. Telgarsky. Representational Strengths and Limitations of Transformers. NeurIPS 2023.

Anthropic. In-context Learning and Induction Heads. 2022.

C. Sanford, D. Hsu, M. Telgarsky. One-layer transformers fail to solve the induction heads task.

Final announcement: I'm on the faculty job market!

Research interests: decision-making under uncertainty

Recent work: Gittins indices for Bayesian optimization, adversarial online learning

Talks about both are available on my website!