x0x1=x0+f(x0)Δtx2=x1+f(x1)Δt..
(θ,y)∼N(μ,Σ)
(θ∣y=γ)∼N(μθ∣y,Σθ∣y)
μθ∣y=μθ+ΣθyΣyy−1(γ−μy)Σθ∣y=Σθθ−ΣθyΣyy−1Σyθ
(θ,y)∼N(μ,Σ)
(θ∣y=γ)=θ+ΣθyΣyy−1(γ−μy)
(f∣y)(⋅)=f(⋅)O(N∗3)f(⋅)+K(⋅)xKxx−1O(N3)Kxx−1(y−f(x))
(f∣y)(⋅)≈i=1∑Lwiϕi(⋅)basis function prior approx.j=1∑mi=1∑ℓwiϕi(⋅)+j=1∑Nvjk(xj,⋅)v=Kxx−1(u−ΦTw)v=Kxx−1(u−ΦTw)O(N∗)complexityin num. test points
Thompson sampling: choose xn+1=x∈Xargminαn(x)αn∼f∣y.
(f∣y)(⋅)=f(⋅)+i=1∑Nvi(y)k(xi,⋅)j=1∑Mvj(u)k(zj,⋅) Exact Gaussian process: O(N3)
(f∣y)(⋅)≈f(⋅)+j=1∑Mvj(u)k(zj,⋅) Sparse Gaussian process: O(NM2)
cond(A)=ε→0lim∥δ∥≤ε∥b∥supε∥A−1b∥2∥∥A−1(b+δ)−A−1b∥∥2=λmin(A)λmax(A)
Result. Let A be a symmetric positive definite matrix of size N×N, where N>10. Assume that cond(A)≤2−t×3.9N3/21 where t is the length of the floating point mantissa, and that 3N2−t<0.1. Then floating point Cholesky factorization will succeed, producing a matrix L satisfying LLT=A+E∥E∥2≤2−t×1.38N3/2∥A∥2.
Are kernel matrices always well-conditioned? No.
One-dimensional time series on grid: Kac–Murdock–Szegö matrix Kxx=⎝⎛1ρ⋮ρn−1ρ1⋮ρn−2ρ2ρ⋱ρn−3……⋱…ρn−1ρn−2⋮1⎠⎞ for which 1−2ρ−2ρε+ρ21+2ρ+2ρε+ρ2≤cond(Kxx)≤(1−ρ)2(1+ρ)2, where ε=N+1π2.
Proposition. Assuming spatial decay and stationarity, separation controls cond(Kzz) uniformly in M.
(f∣y)(⋅)=f(⋅)+K(⋅)z(Kzz+Λ)−1(u−f(z)−ϵ)ϵ∼N(0,Λ)
ν=1/2
ν=3/2
ν=5/2
ν=∞
kν(x,x′)=σ2Γ(ν)21−ν(2νκ∥x−x′∥)νKν(2νκ∥x−x′∥)k∞(x,x′)=σ2exp(−2κ2∥x−x′∥2)
σ2: variance
κ: length scale
ν: smoothness
ν→∞: recovers squared exponential kernel
k∞(dg)(x,x′)=σ2exp(−2κ2dg(x,x′)2)
Theorem. (Feragen et al.) Let M be a complete Riemannian manifold without boundary. If k∞(dg) is positive semi-definite for all κ, then M is isometric to a Euclidean space.
Mateˊrn(κ22ν−Δ)2ν+4df=Wsquared exponential(κ22ν−Δ)2ν+4de−4κ2Δf=W
Δ: Laplacian
W: (rescaled) white noise
e−4κ2Δ: (rescaled) heat semigroup
kν(x,x′)=Cνσ2n=0∑∞(κ22ν−λn)ν−2dfn(x)fn(x′)
f:G→R
f()→R
f()→R
f()→R
(Δf)(x)=x′∼x∑wxx′(f(x)−f(x′)) Δ=D−W
D: degree matrix
W: (weighted) adjacency matrix
Mateˊrn(κ22ν+Δ)2νf=WWWsquared exponential(κ22ν−Δ)2ν+4de4κ2Δf=WWW
Δ: graph Laplacian
WWW: standard Gaussian
Mateˊrnf∼N(0,e−4κ2Δ)f∼N(0,(κ22ν+Δ)−ν)squared exponentialf∼N(0,e−2κ2Δ)
kν(x,x′)=Cνσ2n=1∑∣G∣(κ22ν+λn)−νfn(x)fn(x′) λn,fn: eigenvalues and eigenvectors of graph Laplacian
×
×
=
k(g1,g2)=n=1∑∞a(λn)fn(g1)fn(g2)=λ∈Λ∑a(λ)Reχ(λ)(g2−1⋅g1)
J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. International Conference on Machine Learning, 2020. Honorable Mention for Outstanding Paper Award.
J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. Journal of Machine Learning Research, 2021.
V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Matérn Gaussian Processes on Riemannian Manifolds. Advances in Neural Information Processing Systems, 2020.
V. Borovitskiy,* I. Azangulov,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth, N. Durrande. Matérn Gaussian Processes on Graphs. Artificial Intelligence and Statistics, 2021. Best Student Paper Award.
M. J. Hutchinson,* A. Terenin,* V. Borovitskiy,* S. Takao,* Y. W. Teh, M. P. Deisenroth. Vector-valued Gaussian Processes on Riemannian Manifolds via Gauge Independent Projected Kernels. Advances in Neural Information Processing Systems, 2021.
N. Jaquier,* V. Borovitskiy,* A. Smolensky, A. Terenin, T. Asfour, L. Rozo. Geometry-aware Bayesian Optimization in Robotics using Riemannian Matérn Kernels. Conference on Robot Learning, 2021.
I. Azangulov, A. Smolensky, A. Terenin, V. Borovitskiy. Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the Compact Case. arXiv: 2208.14960, 2022.
A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. arXiv: 2210.07893, 2022.
*Equal contribution