CBL Research Talk, University of Cambridge

Understanding and Scaling Up

Sparse 

Gaussian Processes

Alexander Terenin

https://avt.im/ · @avt_im

Gaussian Processes

Bayesian Optimization

Goal: minimize unknown function $\phi$ in as few evaluations as possible.

  1. Build posterior $f \given \v{y}$ using data $(x_1,\phi(x_1)),..,(x_n,\phi(x_n))$.
  2. Choose $$ \htmlClass{fragment}{ x_{n+1} = \argmin_{x\in\c{X}} \alpha_n(x) } $$ to optimize the acquisition function $\alpha_n$, for instance Thompson sampling $\alpha_n \~ f\given\v{y}$.

Automatic explore-exploit tradeoff.

Bayesian Optimization

Modeling Dynamical Systems with Uncertainty

$$ \htmlData{fragment-index=0,class=fragment}{ x_0 } \qquad \htmlData{fragment-index=1,class=fragment}{ x_1 = x_0 + f(x_0)\Delta t } \qquad \htmlData{fragment-index=2,class=fragment}{ x_2 = x_1 + f(x_1)\Delta t } \qquad \htmlData{fragment-index=3,class=fragment}{ .. } $$

Geospatial Learning

Large-scale probabilistic interpolation
 

Numerically Stable 

Sparse Gaussian Processes

 via Minimum Separation using Cover Trees

Alexander Terenin, David R. Burt, Artem Artemev,

Seth Flaxman, Mark van der Wilk, Carl Rasmussen, and Hong Ge

Sparse Gaussian Processes

 via Pathwise Conditioning

prior term + data term = posterior Gaussian process

$$ \htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} (\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1} (\v{y} - f(\v{x}) - \v\eps) } $$

Sparse Gaussian Processes

$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i^{(\v{y})} k(x_i,\.) $$

$$ (f\given\v{y})(\.) \approx f(\.) + \sum_{j=1}^M v_j^{(\v{u})} k(z_j,\.) $$

Sparse Gaussian Processes

$$ \begin{gathered} (f\given\v{y})(\.) \approx f(\.) + \sum_{j=1}^M v_j^{(\v{u})} k(z_j,\.) \\ \htmlClass{fragment}{ \v{v}^{(\v{u})} = \m{K}_{\v{x}\v{x}}^{-1}(\v{u} - f(\v{x})) } \qquad \htmlClass{fragment}{ \v{u} \~\f{N}(\v{m},\m{S}) } \end{gathered} $$

Optimal $\v{m}, \m{S}$: closed-form expression

Efficient and works well in many settings

Numerical Stability

Condition number: quantifies difficulty of solving $\m{A}^{-1}\v{b}$

$$ \htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} } $$

$\lambda_{\min},\lambda_{\max}$: eigenvalues

Cholesky Factorization

Result. Let $\m{A}$ be a symmetric positive definite matrix of size $N \x N$, where $N \gt 10$. Assume that $$ \f{cond}(\m{A}) \leq \frac{1}{2^{-t} \x 3.9 N^{3/2}} $$ where $t$ is the length of the floating point mantissa, and that $3N2^{-t} \lt 0.1$. Then floating point Cholesky factorization will succeed, producing a matrix $\m{L}$ satisfying $$ \htmlClass{fragment}{ \m{L}\m{L}^T = \m{A} + \m{E} } \qquad\qquad \htmlClass{fragment}{ \norm{\m{E}}_2 \leq 2^{-t} \x 1.38 N^{3/2} \norm{\m{A}}_2 . } $$

Conjugate Gradients

Refinement of gradient descent for solving linear systems $\m{A}^{-1}\v{b}$

Convergence rate depends on $\f{cond}(\m{A})$

Condition Numbers of Kernel Matrices

Are kernel matrices always well-conditioned? No.

The Kac–Murdock–Szegö matrix $$ \m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix} $$ satisfies $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

Condition Numbers of Kernel Matrices

Problem: too much correlation $\leadsto$ points too close by

Minimum Separation

Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

Idea: use this to select numerically stable inducing points

Inducing Point Selection via Cover Trees

Spatial resolution: maximum distance from $x_i$ to nearest $z_j$

Cover tree algorithm: guarantee spatial resolution and separation

The Clustered-data Approximation

$$ \htmlData{class=fragment,fragment-index=0}{ \f{cl}(x) = \argmin_{j=1,..,M} \norm{x - z_j} } \qquad \htmlData{class=fragment,fragment-index=2}{ N_{\f{cl}}(x) = |\{x' \in \v{x} : \f{cl}(x) = \f{cl}(x')\}| } $$

Replace large dataset with re-weighted sparser dataset

The Clustered-data Approximation

$$ \htmlData{class=fragment,fragment-index=0}{ u_j = \frac{1}{N_{\f{cl}}(z_j)}\sum_{\f{cl}(x_i) = z_j} y_i } \qquad \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ y_i \given f } & \htmlData{class=fragment,fragment-index=1}{ \~\f{N}(f(\f{cl}(x_i)),\sigma^2) } \\ \htmlData{class=fragment,fragment-index=2}{ u_i \given f } & \htmlData{class=fragment,fragment-index=2}{ \~\f{N}\del{f(z_i), \tfrac{\sigma^2}{N_{\f{cl}}(z_i)}} } \end{aligned} $$ Then $(f\given\v{u})(\.) = (f\given\v{y})(\.)$ in distribution, and scales with $M \ll N$

The Clustered-data Approximation

$$ \htmlClass{fragment}{ (f\given\v{y})(\.) = f(\.) + \m{K}_{(\.)\v{z}} (\m{K}_{\v{z}\v{z}} + \m\Lambda)^{-1} (\v{u} - f(\v{z}) - \v\epsilon) } \qquad \htmlClass{fragment}{ \v\epsilon \~\f{N}(\v{0},\m\Lambda) } $$

Minimum eigenvalue: $\lambda_{\min}(\m{K}_{\v{z}\v{z}} + \m\Lambda) \geq {\displaystyle\min_{i=1,..,M}} \Lambda_{ii}$

Clustered-data approximation results in extra stability

Empirical Accuracy and Computational Cost

Spatial resolution directly controls error and computational cost

Inducing Point Selection Comparison

Good performance-stability tradeoff in geospatial settings

Geospatial Learning

$\eps = 0.09, M = 902$

$\eps = 0.06, M = 1934$

$\eps = 0.03, M = 6851$

Fine-scale error and computational cost vary with spatial resolution

Approximation Error and Numerical Stability

Dashed line: increased jitter due to Cholesky failure in floating point

Slightly lower approximation accuracy, but better stability

Thank you!

https://avt.im/· @avt_im

A. Terenin, D. R. Burt, A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. arXiv: 2210.07893., 2022.