Talk for ETH Zürich

Sampling from

Gaussian Process

Posteriors using

Stochastic Gradient Descent

Alexander Terenin

https://avt.im/ · @avt_im

Gaussian Process

es

Probabilistic formulation provides uncertainty

Bayesian Optimization

Automatic explore-exploit tradeoff

From Bayesian Optimization to Bayesian Interactive Decision-making

Pathwise Conditioning

$$ \htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} } \htmlData{class=fragment,fragment-index=1} { \mathrlap{\m{K}_{\v{x}\v{x}}^{-1}} } \htmlData{class=fragment,fragment-index=3} { \ubr{\phantom{\m{K}_{\v{x}\v{x}}^{-1}}}{\c{O}(N^3)} } \htmlData{class=fragment,fragment-index=1} { (\v{y} - f(\v{x})) } $$

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) \qquad \htmlData{class=fragment,fragment-index=5} { \v{v} = \m{K}_{\v{x}\v{x}}^{-1}(\v{y} - f(\v{x})) } $$

$\v{v}$: representer weights

$k(x_i,\.)$: canonical basis functions

Conjugate Gradients

Refinement of gradient descent for solving linear systems $\m{A}^{-1}\v{b}$

Convergence rate is much faster than gradient descent

Precise rate depends mainly on $\f{cond}(\m{A})$

Numerical Stability

Condition number: quantifies difficulty of solving $\m{A}^{-1}\v{b}$

$$ \htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} } $$

$\lambda_{\min},\lambda_{\max}$: eigenvalues

Condition Numbers of Kernel Matrices

Are kernel matrices always well-conditioned? No.

One-dimensional time series on grid: Kac–Murdock–Szegö matrix $$ \m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix} $$ for which $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

Condition Numbers of Kernel Matrices

Problem: too much correlation $\leadsto$ points too close by

Minimum Separation

Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

Idea: use this to select numerically stable inducing points

Sampling from Gaussian Process Posteriors

using Stochastic Gradient Descent

Jihao Andreas Lin,* Javier Antorán,* Shreyas Padhy,*

David Janz, José Miguel Hernández-Lobato, Alexander Terenin

Have you tried stochastic gradient descent?

Conventional wisdom in deep learning:

SGD variants are empirically often the best optimization algorithms
ADAM is extremely effective, even on non-convex problems
Minibatch-based training critical part of scalability

Why not try it out for Gaussian process posterior sample paths?

Gaussian Process Posteriors via Randomized Optimization Objectives

Split into posterior mean and uncertainty reduction terms $$ \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ (f\given\v{y})(\.) } & \htmlData{class=fragment,fragment-index=1}{ = f(\.) + \m{K}_{(\.)\v{x}}(\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1}(\v{y} - f(\v{x}) - \v\eps) } \\ & \htmlData{class=fragment,fragment-index=2}{ = f(\.) + \sum_{i=1}^N v_i^* k(x_i,\.) + \sum_{i=1}^N \alpha_i^* k(x_i,\.) } \end{aligned} $$ where $$ \begin{aligned} \htmlData{class=fragment,fragment-index=4}{ \v{v}^* } & \htmlData{class=fragment,fragment-index=4}{ = \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} } \\ \htmlData{class=fragment,fragment-index=5}{ \v\alpha^* } & \htmlData{class=fragment,fragment-index=5}{ = \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . } \end{aligned} $$

Gaussian Process Posteriors via Randomized Optimization Objectives

$$ \begin{aligned} \v{v}^* &= \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} \\ \v\alpha^* &= \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . \end{aligned} $$

First term: apply mini-batch estimation

Second term: evaluate stochastically via random Fourier features

Variance reduction trick: shift $\eps_i$ into regularizer

Use stochastic gradient descent with Polyak averaging

Stochastic Gradient Descent

It works better than conjugate gradients on test data? Wait, what?

What happens in one dimension?

Performance depends on data-generation asymptotics

What happens in one dimension?

SGD does not converge to the correct solution,

but still produces reasonable error bars

$\leadsto$ implicit bias?

Convergence: Euclidean and RKHS Norms

No convergence in representer weight space

or in the reproducing kernel Hilbert space

Good test

performance

Convergence: Euclidean and RKHS Norms

Performance not significantly affected by noise

Unstable optimization problem $\leadsto$ benign non-convergence

$\leadsto$ implicit bias

Where does SGD's implicit bias affect predictions?

Error seems to concentrate away from data, but not too far away?

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Far-away region

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Far-away region

Where does SGD's implicit bias affect predictions?

Interpolation region

Extrapolation region

Far-away region

The Far-away Region

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) $$

Kernel decays in space $\leadsto$ predictions revert to prior

The Interpolation Region

Low approximation error where data is dense

The Interpolation Region

Idea: maybe SGD (a) converges fast on a subspace, and (b) obtains something arbitrary but benign on the rest of the space?

Let $\m{K}_{\v{x}\v{x}} = \m{U}\m\Lambda\m{U}^T$ be the eigendecomposition of the kernel matrix. Define the spectral basis functions $$ u^{(i)}(\.) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \.) . $$

The Interpolation Region

Large-eigenvalue spectral basis functions concentrate on data

The Interpolation Region

Proposition. With probability $1-\delta$, we have $$ \norm{\text{proj}_{u^{(i)}} \mu_{f\given\v{y}} - \text{proj}_{u^{(i)}} \mu_{\f{SGD}}}_{H_k} \leq \frac{1}{\sqrt{\lambda_i t}}\del{\frac{\norm{\v{y}}_2}{\eta\sigma^2} + G\sqrt{2\eta\sigma^2 \log\frac{N}{\delta}}} . $$

$\eta$: learning rate

$\sigma^2$: noise variance

$\lambda_i$: kernel matrix eigenvalues

$G$: gradient's sub-Gaussian coefficient

SGD converges fast with respect to top spectral basis functions

The Interpolation Region

Where do the top spectral basis functions concentrate?

Idea: lift Courant–Fischer eigenvector characterization to the RKHS

Proposition. The spectral basis functions can be written $$ u^{(i)}(\.) \htmlClass{fragment}{ = \argmax_{ \htmlClass{fragment}{ u \in H_k } } \cbr{ \htmlClass{fragment}{ \sum_{i=1}^N u(x_i)^2 } \htmlClass{fragment}{ : } \begin{aligned} & \htmlClass{fragment}{ \norm{u}_{H_k} = 1 } \\ & \htmlClass{fragment}{ \langle u,u^{(j)}\rangle_{H_k} = 0 } \htmlClass{fragment}{, \forall j \lt i } \end{aligned} }. } $$

Spectral basis functions concentrate on the data as much as possible, while remaining orthogonal to those with larger eigenvalues

The Extrapolation Region

High error:

between the interpolation and far-away regions

where small-eigenvalue spectral basis functions concentrate

SGD's implicit bias for Gaussian processes

Implicit bias: SGD converges quickly near the data, and causes no harm far from the data

Error concentrates in regions (a) without much data, which also (b) aren't located too far from the data:

Lack of data $\leadsto$ predictions are mostly arbitrary
Empirically: functions shrink to prior faster than exact posterior

Benign non-convergence $\leadsto$ robustness to instability

Performance

Conjugate gradients: non-monotonic test error, in spite of monotonic convergence

SGD: almost always monotonic decrease in test error

Performance

Strong predictive performance at sufficient scale

Parallel Thompson Sampling

Uncertainty: strong decision-making performance

Reflections

Numerical analysis conventional wisdom:

Don't run gradient descent on quadratic objectives
- It's slow, conjugate gradient works much better
If CG is slow then your problem is unstable
- Unstable problems are ill-posed, you should reformulate

This work: a very different way of looking at things

Don't solve the linear system approximately if the solution isn't inherently needed
Instead of a well-posed problem, a well-posed subproblem might be good enough
Try SGD! It might work well, including for counterintuitive reasons
A kernel matrix's eigenvectors carry information about data-density
- To see this, adopt a function-analytic view given by the spectral basis functions

Thank you!

https://avt.im/· @avt_im

Thank you!

https://avt.im/· @avt_im

J. A. Lin,* J. Antorán,* S. Padhy,* D. Janz, J. M. Hernández-Lobato, A. Terenin. Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. arXiv:2306.11589, 2023.

A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. arXiv: 2210.07893, 2022.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. International Conference on Machine Learning, 2020. Honorable Mention for Outstanding Paper Award.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. Journal of Machine Learning Research, 2021.

*Equal contribution