Talk for Dagstuhl on Bayesian Optimization
Alexander Terenin
Automatic explore-exploit tradeoff
Today, a nontrivial fraction of ML researchers think AGI is imminent
I won't answer these questions directly
Long-run goal: framework for thinking about transformer-based tools in BayesOpt context
Sequence-to-sequence models: $f : \ubr{S_n(\Sigma)}{\htmlClass{fragment}{\substack{\t{sequences}\\\t{of length}n}}} \to S_n\ubr{(\Sigma)}{\htmlClass{fragment}{\t{alphabet}}}$
Training via classification, sampling via softmax scores
What can one do using these capabilities?
One layer's (simplified) sequence-to-sequence map: $$ x_t^{(\ell+1)} = x_t^{(\ell)} + \ubr{f_{\f{MLP}}(x_t^{(\ell)})}{\htmlClass{fragment}{\t{token-by-token}}} \quad+\quad \ubr{f_{\f{attn}}(x_{1:t}^{(\ell)})}{\substack{\htmlClass{fragment}{\t{causally masked input seq.}}\\\htmlClass{fragment}{\t{with position encoding}}}} $$
$t$: token indices
$\ell$: layers
$x_t^{(\ell)}$: $d$-dimensional vector
Attention: involves an matrix of all token pairs – empirically sparse!
This part: heavily inspired by Matus Telgarsky's short course on deep learning theory
Definition. The induction head task or 1-hop task is a sequence-to-sequence mapping task$S_n(\Sigma) \to S_n(\Sigma\u\{\bot\})$ defined as follows. Given $(x_i)_{i=1}^n$, we compute $(y_i)_{i=1}^n$ element-by-element as follows:
$K$-hop: same task repeated recursively
At first looks contrived, and yet:
Also solvable in $2 + \lfloor\log_2 K\rfloor$ layers!
Theorem. There exists a transformer with $2 + \lfloor\log_2 K\rfloor$ layers, $\c{O}(\log(n) + \log(|\Sigma|))$ bit precision, where $n$ is the context window size and $\Sigma$ is the alphabet, one attention head, and representation dimension $d = \c{O}(1)$.
Proof idea: guess the attention matrices.
One can also show:
Significance: transformers can learn algorithms!
Significance: transformers can learn algorithms!
Consider a transformer $f : \ubr{S_n(\Sigma)}{\htmlClass{fragment}{\t{sequences}}} \to \ubr{Y\vphantom{|}}{\htmlClass{fragment}{\t{classes}}}$
Goal of training: learn correct class
Optimization objective: cross-entropy loss
Result: learn full distribution $\ubr{p(y \given s_{1:n})}{\htmlClass{fragment}{\t{softmax scores}}}$ – wait, what?
Obvious, but unintuitive: square loss is not like this!
Critical for transformers' generative capabilities
To build intuition, let's consider the trivial setting: $f : S_1(\{1\}) \to Y$
How does cross-entropy work here?
Differentiating $L$: $$ \frac{\p}{\p\h{y}_j} L(\h{y}; y) \htmlClass{fragment}{ = - \frac{\p}{\p\h{y}_j}\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{|Y|} y_{ik} \log \h{y}_k } \htmlClass{fragment}{ = -\frac{1}{N} \sum_{i=1}^N \frac{y_{ij}}{\h{y}_j} } \htmlClass{fragment}{ = -\frac{\frac{1}{N} \sum_{i=1}^N y_{ij}}{\h{y}_j} } $$ Constraint: $\sum_{k=1}^{|Y|} \h{y}_k = 1$, solve via Lagrange multipliers
Result: $\displaystyle\frac{\sum_{i=1}^N y_{i(\.)}}{N} = \argmin_{\h{y}\in\c{M}_1(Y)} L(\h{y};y)$
Cross-entropy is a distribution-learning objective
Transformers are sequence-to-sequence models:
Transformers can internally represent algorithms:
Transformers learn conditional distributions over discrete spaces:
I don't know! But remember:
These technical primitives are clearly very powerful
Some vague ideas:
References
Anthropic. A Mathematical Framework for Transformer Circuits. 2021.
C. Sanford, D. Hsu, M. Telgarsky. Representational Strengths and Limitations of Transformers. NeurIPS 2023.
Anthropic. In-context Learning and Induction Heads. 2022.
C. Sanford, D. Hsu, M. Telgarsky. One-layer transformers fail to solve the induction heads task.
Research interests: decision-making under uncertainty
Recent work: Gittins indices for Bayesian optimization, adversarial online learning
Talks about both are available on my website!