Talk for ISBA

Sampling from

Gaussian Process

Posteriors using

Stochastic Gradient Descent

Alexander Terenin

https://avt.im/ · @avt_im

Gaussian Process

es

Probabilistic formulation provides uncertainty

Model: $y_i = f(x_i) + \eps_i$ with

$\eps_i\~\f{N}(0,\sigma^2)$, $f\~\f{GP}(0,k)$

Kernel matrix: $\m{K}_{\v{x}\v{x}}$

with entries $k(x_i,x_j)$

Bayesian Optimization

Automatic explore-exploit tradeoff

Pathwise Conditioning

$$ \htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} } \htmlData{class=fragment,fragment-index=1} { \mathrlap{\m{K}_{\v{x}\v{x}}^{-1}} } \htmlData{class=fragment,fragment-index=3} { \ubr{\phantom{\m{K}_{\v{x}\v{x}}^{-1}}}{\c{O}(N^3)} } \htmlData{class=fragment,fragment-index=1} { (\v{y} - f(\v{x})) } $$

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) \qquad \htmlData{class=fragment,fragment-index=5} { \v{v} = \m{K}_{\v{x}\v{x}}^{-1}(\v{y} - f(\v{x})) } $$

$\v{v}$: representer weights

$k(x_i,\.)$: canonical basis functions

Conjugate Gradients

Refinement of gradient descent for solving linear systems $\m{A}^{-1}\v{b}$

Convergence rate generally much faster than gradient descent

Numerical Stability

What does convergence rate of CG depend on? Condition number.

$$ \htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \mathrlap{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}} } \htmlClass{fragment}{ \ubr{\phantom{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}}}{\t{stability to perturbation by}\delta} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} } $$

$\lambda_{\min},\lambda_{\max}$: eigenvalues

Condition Numbers of Kernel Matrices

Are kernel matrices always numerically stable? No.

One-dimensional time series on grid: Kac–Murdock–Szegö matrix $$ \m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix} $$ for which $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

Condition Numbers of Kernel Matrices

Problem: too much correlation $\leadsto$ points too close by

Idea: limit maximum correlation $\leadsto$ gain performance?

Minimum Separation

Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

Approach: use this to select numerically stable inducing points

Sampling from Gaussian Process Posteriors

using Stochastic Gradient Descent

Jihao Andreas Lin,* Javier Antorán,* Shreyas Padhy,*

David Janz, José Miguel Hernández-Lobato, Alexander Terenin

Have you tried stochastic gradient descent?

Conventional wisdom in deep learning:

SGD variants are empirically often the best optimization algorithms
ADAM is extremely effective, even on non-convex problems
Minibatch-based training critical part of scalability

Why not try it out for Gaussian process posterior sample paths?

Gaussian Process Posteriors via Randomized Optimization Objectives

Split into posterior mean and uncertainty reduction terms $$ \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ (f\given\v{y})(\.) } & \htmlData{class=fragment,fragment-index=1}{ = f(\.) + \m{K}_{(\.)\v{x}}(\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1}(\v{y} - f(\v{x}) - \v\eps) } \\ & \htmlData{class=fragment,fragment-index=2}{ = f(\.) + \sum_{i=1}^N v_i^* k(x_i,\.) + \sum_{i=1}^N \alpha_i^* k(x_i,\.) } \end{aligned} $$ where $$ \begin{aligned} \htmlData{class=fragment,fragment-index=4}{ \v{v}^* } & \htmlData{class=fragment,fragment-index=4}{ = \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} } \\ \htmlData{class=fragment,fragment-index=5}{ \v\alpha^* } & \htmlData{class=fragment,fragment-index=5}{ = \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . } \end{aligned} $$

Gaussian Process Posteriors via Randomized Optimization Objectives

$$ \begin{aligned} \v{v}^* &= \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} \\ \v\alpha^* &= \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . \end{aligned} $$

First term: apply mini-batch estimation

Second term: evaluate stochastically via random Fourier features

Variance reduction trick: shift $\eps_i$ into regularizer

Use stochastic gradient descent with Polyak averaging

Stochastic Gradient Descent

It works better than conjugate gradients on test data? Wait, what?

What happens in one dimension?

Performance depends on data-generation asymptotics

What happens in one dimension?

SGD does not converge to the correct solution,

but still produces reasonable error bars

$\leadsto$ implicit bias?

Convergence: Euclidean and RKHS Norms

No real convergence in representer weight space

or in the reproducing kernel Hilbert space

Good test

performance

Convergence: Euclidean and RKHS Norms

Performance not significantly affected by noise

Unstable optimization problem $\leadsto$ benign non-convergence

$\leadsto$ implicit bias

Where does SGD's implicit bias affect predictions?

Error seems to concentrate away from data, but not too far away?

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Prior region

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Prior region

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Prior region

The Prior Region

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) $$

Kernel decays in space $\leadsto$ predictions revert to prior

The Interpolation Region

Low approximation error where data is dense

The Interpolation Region

Idea: maybe SGD (a) converges fast on a subspace, and (b) obtains something arbitrary but benign on the rest of the space?

Let $\m{K}_{\v{x}\v{x}} = \m{U}\m\Lambda\m{U}^T$ be the eigendecomposition of the kernel matrix. Define the spectral basis functions $$ u^{(i)}(\.) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \.) . $$

The Interpolation Region

Large-eigenvalue spectral basis functions concentrate on data

The Interpolation Region

Proposition. With probability $1-\delta$, we have $$ \norm{\text{proj}_{u^{(i)}} \mu_{f\given\v{y}} - \text{proj}_{u^{(i)}} \mu_{\f{SGD}}}_{H_k} \leq \frac{1}{\sqrt{\lambda_i t}}\del{\frac{\norm{\v{y}}_2}{\eta\sigma^2} + G\sqrt{2\eta\sigma^2 \log\frac{N}{\delta}}} . $$

$\eta$: learning rate

$\sigma^2$: noise variance

$\lambda_i$: kernel matrix eigenvalues

$G$: gradient's sub-Gaussian coefficient

SGD converges fast with respect to top spectral basis functions

The Interpolation Region

Where do the top spectral basis functions concentrate?

Idea: lift Courant–Fischer eigenvector characterization to the RKHS
Result: RKHS view of kernel PCA

Proposition. The spectral basis functions can be written $$ u^{(i)}(\.) \htmlClass{fragment}{ = \argmax_{ \htmlClass{fragment}{ u \in H_k } } \cbr{ \htmlClass{fragment}{ \sum_{i=1}^N u(x_i)^2 } \htmlClass{fragment}{ : } \begin{aligned} & \htmlClass{fragment}{ \norm{u}_{H_k} = 1 } \\ & \htmlClass{fragment}{ \langle u,u^{(j)}\rangle_{H_k} = 0 } \htmlClass{fragment}{, \forall j \lt i } \end{aligned} }. } $$

Spectral basis functions concentrate on the data as much as possible, while remaining orthogonal to those with larger eigenvalues

The Extrapolation Region

High error:

between the interpolation and prior regions

where small-eigenvalue spectral basis functions concentrate

SGD's implicit bias for Gaussian processes

Implicit bias: SGD converges quickly near the data, and causes no harm far from the data

Error concentrates in regions (a) without much data, which also (b) aren't located too far from the data:

Lack of data $\leadsto$ predictions are mostly arbitrary
Empirically: functions shrink to prior faster than exact posterior

Benign non-convergence $\leadsto$ robustness to instability

Performance

Conjugate gradients: non-monotonic test error, in spite of monotonic convergence

SGD: almost always monotonic decrease in test error

Performance

Strong predictive performance at sufficient scale

Parallel Thompson Sampling: Stochastic Gradient Descent

Uncertainty: strong decision-making performance

Parallel Thompson Sampling: Stochastic Dual Descent

Uncertainty: strong decision-making performance

Reflections

Numerical analysis conventional wisdom:

Don't run gradient descent on quadratic objectives
- It's slow, conjugate gradient works much better
If CG is slow then your problem is unstable
- Unstable problems are ill-posed, you should reformulate

This work: a very different way of looking at things

Don't solve the linear system approximately if the solution isn't inherently needed
Instead of a well-posed problem, a well-posed subproblem might be good enough
Try SGD! It might work well, including for counterintuitive reasons
A kernel matrix's eigenvectors carry information about data-density
- To see this: function-analytic view via spectral basis functions and kernel PCA

Applications and Current Work

Bayesian Optimization with

Function Networks for Molecule Design

Use Bayesian optimization to find good candidates in generative model's latent space

Challenges: need a lot of samples (this talk), multi-stage evaluation with partial feedback

Pandora's Box Gittins Index

Novel technical framework for deriving acquisition functions

Next step: balance explore-exploit tradeoffs in settings such as function network feedback

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

J. A. Lin,* S. Padhy,* J. Antorán,* A. Tripp, A. Terenin, Cs. Szepesvári, J. M. Hernández-Lobato, D. Janz. Stochastic Gradient Descent for Gaussian Processes Done Right. ICLR, 2024.

J. A. Lin,* J. Antorán,* S. Padhy,* D. Janz, J. M. Hernández-Lobato, A. Terenin. Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. NeurIPS, 2023. Selected for Oral Presentation.

A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. JMLR, 2024.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. ICML, 2020. Honorable Mention for Outstanding Paper Award.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. JMLR, 2021.

Q. Xie, R. Astudillo, P. Frazier, Z. Scully, A. Terenin. Cost-aware Bayesian Optimization via the Pandora's Box Gittins Index. arXiv:2406.20062, 2024.

*Equal contribution