Talk for Cornell SCAN Seminar

Sampling from 

Gaussian Process

 Posteriors using

Stochastic Gradient Descent

Alexander Terenin

https://avt.im/ · @avt_im

Gaussian Process

es

Probabilistic formulation provides uncertainty

Bayesian Optimization

Automatic explore-exploit tradeoff

From Bayesian Optimization to Bayesian Interactive Decision-making

Pathwise Conditioning

prior term + data term = posterior Gaussian process

$$ \htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} } \htmlData{class=fragment,fragment-index=1} { \mathrlap{\m{K}_{\v{x}\v{x}}^{-1}} } \htmlData{class=fragment,fragment-index=3} { \ubr{\phantom{\m{K}_{\v{x}\v{x}}^{-1}}}{\c{O}(N^3)} } \htmlData{class=fragment,fragment-index=1} { (\v{y} - f(\v{x})) } $$

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) \qquad \htmlData{class=fragment,fragment-index=5} { \v{v} = \m{K}_{\v{x}\v{x}}^{-1}(\v{y} - f(\v{x})) } $$

$\v{v}$: representer weights

$k(x_i,\.)$: canonical basis functions

Conjugate Gradients

Refinement of gradient descent for solving linear systems $\m{A}^{-1}\v{b}$

Convergence rate generally much faster than gradient descent

Numerical Stability

What does convergence rate of CG depend on? Condition number.

$$ \htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \mathrlap{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}} } \htmlClass{fragment}{ \ubr{\phantom{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}}}{\t{stability to perturbation by}\delta} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} } $$

$\lambda_{\min},\lambda_{\max}$: eigenvalues

Condition Numbers of Kernel Matrices

Are kernel matrices always numerically stable? No.

One-dimensional time series on grid: Kac–Murdock–Szegö matrix $$ \m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix} $$ for which $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

Condition Numbers of Kernel Matrices

Problem: too much correlation $\leadsto$ points too close by

Idea: limit maximum correlation $\leadsto$ gain performance?

Minimum Separation

Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

Approach: use this to select numerically stable inducing points

It turns out: stability isn't neededThere's a different way to gain performance

Sampling from Gaussian Process Posteriors

using Stochastic Gradient Descent

Jihao Andreas Lin,* Javier Antorán,* Shreyas Padhy,*

David Janz, José Miguel Hernández-Lobato, Alexander Terenin

*equal contribution

NeurIPS 2023 Oral

Have you tried stochastic gradient descent?

Conventional wisdom in deep learning:

  • SGD variants are empirically often the best optimization algorithms
  • ADAM is extremely effective, even on non-convex problems
  • Minibatch-based training critical part of scalability

Why not try it out for Gaussian process posterior sample paths?

Gaussian Process Posteriors via Randomized Optimization Objectives

Split into posterior mean and uncertainty reduction terms $$ \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ (f\given\v{y})(\.) } & \htmlData{class=fragment,fragment-index=1}{ = f(\.) + \m{K}_{(\.)\v{x}}(\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1}(\v{y} - f(\v{x}) - \v\eps) } \\ & \htmlData{class=fragment,fragment-index=2}{ = f(\.) + \sum_{i=1}^N v_i^* k(x_i,\.) + \sum_{i=1}^N \alpha_i^* k(x_i,\.) } \end{aligned} $$ where $$ \begin{aligned} \htmlData{class=fragment,fragment-index=4}{ \v{v}^* } & \htmlData{class=fragment,fragment-index=4}{ = \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} } \\ \htmlData{class=fragment,fragment-index=5}{ \v\alpha^* } & \htmlData{class=fragment,fragment-index=5}{ = \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . } \end{aligned} $$

Gaussian Process Posteriors via Randomized Optimization Objectives

$$ \begin{aligned} \v{v}^* &= \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} \\ \v\alpha^* &= \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . \end{aligned} $$

First term: apply mini-batch estimation

Second term: evaluate stochastically via random Fourier features

Variance reduction trick: shift $\eps_i$ into regularizer

Use stochastic gradient descent with Polyak averaging

Stochastic Gradient Descent

It works better than conjugate gradients on test data? Wait, what?

What happens in one dimension?

Performance depends on data-generation asymptotics

What happens in one dimension?

SGD does not converge to the correct solution,

but still produces reasonable error bars

$\leadsto$ implicit bias?

Convergence: Euclidean and RKHS Norms

No real convergence in representer weight space

or in the reproducing kernel Hilbert space

Good test

performance

Convergence: Euclidean and RKHS Norms

Performance not significantly affected by noise

Unstable optimization problem $\leadsto$ benign non-convergence

$\leadsto$ implicit bias

Where does SGD's implicit bias affect predictions?

Error seems to concentrate away from data, but not too far away?

Where does SGD's implicit bias affect predictions?

Interpolation region

Where does SGD's implicit bias affect predictions?

Extrapolation region

Where does SGD's implicit bias affect predictions?

Prior region

The Prior Region

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) $$

Kernel decays in space $\leadsto$ predictions revert to prior

The Interpolation Region

Low approximation error where data is dense

The Interpolation Region

Idea: maybe SGD (a) converges fast on a subspace, and (b) obtains something arbitrary but benign on the rest of the space?

Let $\m{K}_{\v{x}\v{x}} = \m{U}\m\Lambda\m{U}^T$ be the eigendecomposition of the kernel matrix. Define the spectral basis functions $$ u^{(i)}(\.) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \.) . $$

The Interpolation Region

Large-eigenvalue spectral basis functions concentrate on data

The Interpolation Region

Proposition. With probability $1-\delta$, we have $$ \norm{\text{proj}_{u^{(i)}} \mu_{f\given\v{y}} - \text{proj}_{u^{(i)}} \mu_{\f{SGD}}}_{H_k} \leq \frac{1}{\sqrt{\lambda_i t}}\del{\frac{\norm{\v{y}}_2}{\eta\sigma^2} + G\sqrt{2\eta\sigma^2 \log\frac{N}{\delta}}} . $$

$\eta$: learning rate

$\sigma^2$: noise variance

$\lambda_i$: kernel matrix eigenvalues

$G$: gradient's sub-Gaussian coefficient

SGD converges fast with respect to top spectral basis functions

The Interpolation Region

Where do the top spectral basis functions concentrate?

  • Idea: lift Courant–Fischer eigenvector characterization to the RKHS

Proposition. The spectral basis functions can be written $$ u^{(i)}(\.) \htmlClass{fragment}{ = \argmax_{ \htmlClass{fragment}{ u \in H_k } } \cbr{ \htmlClass{fragment}{ \sum_{i=1}^N u(x_i)^2 } \htmlClass{fragment}{ : } \begin{aligned} & \htmlClass{fragment}{ \norm{u}_{H_k} = 1 } \\ & \htmlClass{fragment}{ \langle u,u^{(j)}\rangle_{H_k} = 0 } \htmlClass{fragment}{, \forall j \lt i } \end{aligned} }. } $$

Spectral basis functions concentrate on the data as much as possible, while remaining orthogonal to those with larger eigenvalues

The Extrapolation Region

High error: 

between the interpolation and prior regions

where small-eigenvalue spectral basis functions concentrate

SGD's implicit bias for Gaussian processes

Implicit bias: SGD converges quickly near the data, and causes no harm far from the data

Error concentrates in regions (a) without much data, which also (b) aren't located too far from the data:

  • Lack of data $\leadsto$ predictions are mostly arbitrary
  • Empirically: functions shrink to prior faster than exact posterior

Benign non-convergence $\leadsto$ robustness to instability

Performance

Conjugate gradients: non-monotonic test error, in spite of monotonic convergence

SGD: almost always monotonic decrease in test error

Performance

Strong predictive performance at sufficient scale

Parallel Thompson Sampling: Stochastic Gradient Descent

Uncertainty: strong decision-making performance

Parallel Thompson Sampling: Stochastic Dual Descent

Uncertainty: strong decision-making performance

Reflections

Numerical analysis conventional wisdom:

  • Don't run gradient descent on quadratic objectives
    • It's slow, conjugate gradient works much better
  • If CG is slow then your problem is unstable
    • Unstable problems are ill-posed, you should reformulate

This work: a very different way of looking at things

  • Don't solve the linear system approximately if the solution isn't inherently needed
  • Instead of a well-posed problem, a well-posed subproblem might be good enough
  • Try SGD! It might work well, including for counterintuitive reasons
  • A kernel matrix's eigenvectors carry information about data-density
    • To see this, adopt a function-analytic view given by the spectral basis functions
“When solving a given problem, try to avoid solving a more general problem as an intermediate step.” — Vladimir Vapnik The Nature of Statistical Learning

Thank you!

https://avt.im/· @avt_im

J. A. Lin,* S. Padhy,* J. Antorán,* A. Tripp, A. Terenin, Cs. Szepesvári, J. M. Hernández-Lobato, D. Janz. Stochastic Gradient Descent for Gaussian Processes Done Right. International Conference on Learning Representations, 2024.

J. A. Lin,* J. Antorán,* S. Padhy,* D. Janz, J. M. Hernández-Lobato, A. Terenin. Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. Advances in Neural Information Processing System, 2023. Selected for Oral Presentation.

A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. Journal of Machine Learning Research, 2024.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. International Conference on Machine Learning, 2020. Honorable Mention for Outstanding Paper Award.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. Journal of Machine Learning Research, 2021.

*Equal contribution