$$ \htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} } \htmlData{class=fragment,fragment-index=1} { \mathrlap{\m{K}_{\v{x}\v{x}}^{-1}} } \htmlData{class=fragment,fragment-index=3} { \ubr{\phantom{\m{K}_{\v{x}\v{x}}^{-1}}}{\c{O}(N^3)} } \htmlData{class=fragment,fragment-index=1} { (\v{y} - f(\v{x})) } $$
$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) \qquad \htmlData{class=fragment,fragment-index=5} { \v{v} = \m{K}_{\v{x}\v{x}}^{-1}(\v{y} - f(\v{x})) } $$
What does convergence rate of CG depend on? Condition number.
$$ \htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \mathrlap{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}} } \htmlClass{fragment}{ \ubr{\phantom{\frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2}}}{\t{stability to perturbation by}\delta} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} } $$
Are kernel matrices always numerically stable? No.
One-dimensional time series on grid: Kac–Murdock–Szegö matrix $$ \m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix} $$ for which $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.
Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.
Conventional wisdom in deep learning:
Split into posterior mean and uncertainty reduction terms $$ \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ (f\given\v{y})(\.) } & \htmlData{class=fragment,fragment-index=1}{ = f(\.) + \m{K}_{(\.)\v{x}}(\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1}(\v{y} - f(\v{x}) - \v\eps) } \\ & \htmlData{class=fragment,fragment-index=2}{ = f(\.) + \sum_{i=1}^N v_i^* k(x_i,\.) + \sum_{i=1}^N \alpha_i^* k(x_i,\.) } \end{aligned} $$ where $$ \begin{aligned} \htmlData{class=fragment,fragment-index=4}{ \v{v}^* } & \htmlData{class=fragment,fragment-index=4}{ = \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} } \\ \htmlData{class=fragment,fragment-index=5}{ \v\alpha^* } & \htmlData{class=fragment,fragment-index=5}{ = \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . } \end{aligned} $$
$$ \begin{aligned} \v{v}^* &= \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} \\ \v\alpha^* &= \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . \end{aligned} $$
$$ (f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) $$
Idea: maybe SGD (a) converges fast on a subspace, and (b) obtains something arbitrary but benign on the rest of the space?
Let $\m{K}_{\v{x}\v{x}} = \m{U}\m\Lambda\m{U}^T$ be the eigendecomposition of the kernel matrix. Define the spectral basis functions $$ u^{(i)}(\.) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \.) . $$
Proposition. With probability $1-\delta$, we have $$ \norm{\text{proj}_{u^{(i)}} \mu_{f\given\v{y}} - \text{proj}_{u^{(i)}} \mu_{\f{SGD}}}_{H_k} \leq \frac{1}{\sqrt{\lambda_i t}}\del{\frac{\norm{\v{y}}_2}{\eta\sigma^2} + G\sqrt{2\eta\sigma^2 \log\frac{N}{\delta}}} . $$
Where do the top spectral basis functions concentrate?
Proposition. The spectral basis functions can be written $$ u^{(i)}(\.) \htmlClass{fragment}{ = \argmax_{ \htmlClass{fragment}{ u \in H_k } } \cbr{ \htmlClass{fragment}{ \sum_{i=1}^N u(x_i)^2 } \htmlClass{fragment}{ : } \begin{aligned} & \htmlClass{fragment}{ \norm{u}_{H_k} = 1 } \\ & \htmlClass{fragment}{ \langle u,u^{(j)}\rangle_{H_k} = 0 } \htmlClass{fragment}{, \forall j \lt i } \end{aligned} }. } $$
Spectral basis functions concentrate on the data as much as possible, while remaining orthogonal to those with larger eigenvalues
Implicit bias: SGD converges quickly near the data, and causes no harm far from the data
Error concentrates in regions (a) without much data, which also (b) aren't located too far from the data:
Benign non-convergence $\leadsto$ robustness to instability
Numerical analysis conventional wisdom:
This work: a very different way of looking at things
J. A. Lin,* S. Padhy,* J. Antorán,* A. Tripp, A. Terenin, Cs. Szepesvári, J. M. Hernández-Lobato, D. Janz. Stochastic Gradient Descent for Gaussian Processes Done Right. ICLR, 2024.
J. A. Lin,* J. Antorán,* S. Padhy,* D. Janz, J. M. Hernández-Lobato, A. Terenin. Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. NeurIPS, 2023. Selected for Oral Presentation.
A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. JMLR, 2024.
J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. ICML, 2020. Honorable Mention for Outstanding Paper Award.
J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. JMLR, 2021.
*Equal contribution