Bayes linear regression and basis-functions in Gaussian process regression

a.k.a Fixed Rank Kriging, weight space GPs

February 22, 2022 — July 27, 2022

graphical models
Hilbert space
kernel tricks
machine learning
Figure 1: No, these are officers. You want low rank Gaussian processes.

Another way of cunningly chopping up the work of fitting a Gaussian process is to represent the process as a random function comprising basis functions \(\phi=\left(\phi_{1}, \ldots, \phi_{\ell}\right)\) with the Gaussian random weight vector \(w\) so that \[ f^{(w)}(\cdot)=\sum_{i=1}^{\ell} w_{i} \phi_{i}(\cdot) \quad \boldsymbol{w} \sim \mathcal{N}\left(\mathbf{0}, \boldsymbol{\Sigma}_{\boldsymbol{w}}\right). \] \(f^{(w)}\) is a random function satisfying \(\boldsymbol{f}^{(\boldsymbol{w})} \sim \mathcal{N}\left(\mathbf{0}, \boldsymbol{\Phi}_{n} \boldsymbol{\Sigma}_{\boldsymbol{w}} \boldsymbol{\Phi}^{\top}\right)\), where \(\boldsymbol{\Phi}_{n}=\boldsymbol{\phi}(\mathbf{X})\) is a \(|\mathbf{X}| \times \ell\) matrix of features. This is referred to as a weight space approach in ML.

TODO: I just assumed centred weights here, but that is crazy. Update to relax that assumption.

We might imagine this representation would be exact if we had countably many basis functions, and under sane conditions it is. We would like to know, further, that we can find a basis such that we need not too many basis functions to represent the process. Looking at the Karhunen-Loève theorem theorem we might imagine that this can sometimes work out fine, and indeed it does, sometimes.

This is a classic; see Chapter 3 of Bishop (2006) is classic and nicely clear. Cressie and Wikle (2011) targets for the spatiotemporal context.

Hijinks ensue when selecting the basis functions. If we were to treat the natural Hilbert space here seriously we could consider identifying the bases as eigenfunctions of the kernel. This is not generally easy. We tend to use either global bases such as Fourier bases or more generally Karhunen-Loéve bases, or construct local bases of limited overlap (usually piecewise polynomials AFAICT).

The kernel trick writes a kernel \(k\) as an inner product in a corresponding reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_{k}\) with a feature map \(\varphi: \mathcal{X} \rightarrow \mathcal{H}_{k} .\) In sufficiently nice cases the kernel is well approximated \[ k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\varphi(\boldsymbol{x}), \varphi\left(\boldsymbol{x}^{\prime}\right)\right\rangle_{\mathcal{H}_{k}} \approx \boldsymbol{\phi}(\boldsymbol{x})^{\top} \overline{\boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right)} \] where \(\boldsymbol{\phi}: \mathcal{X} \rightarrow \mathbb{C}^{\ell}\) is a finite-dimensional feature map. TODO: What is the actual guarantee here?

1 Fourier features

When the Fourier basis is natural for the problem we are in a pretty good situation. We can use the Wiener Khintchine relations to analyse and simulate the process. Connection perhaps to Fourier features in neural nets?

2 Random Fourier features

The random Fourier features method (Rahimi and Recht 2007, 2008) constructs a Monte Carlo estimate to a stationary kernel by representing the inner product in terms of \(\ell\) complex exponential basis functions \(\phi_{j}(\boldsymbol{x})=\ell^{-1 / 2} \exp \left(i \boldsymbol{\omega}_{j}^{\top} \boldsymbol{x}\right)\) with frequency parameters \(\boldsymbol{\omega}_{j}\) sampled proportionally to the spectral density \(\rho\left(\boldsymbol{\omega}_{j}\right).\)

This sometimes has a favourable error rate (Sutherland and Schneider 2015).

3 K-L basis

We recall from the Karhunen-Loéve notebook that the mean-square-optimal \(f^{(w)}\) for approximating a Gaussian process \(f\) is found by truncating the Karhunen-Loéve expansion \[ f(\cdot)=\sum_{i=1}^{\infty} w_{i} \phi_{i}(\cdot) \quad w_{i} \sim \mathcal{N}\left(0, \lambda_{i}\right) \] where \(\phi_{i}\) and \(\lambda_{i}\) are, respectively, the \(i\)-th (orthogonal) eigenfunction and eigenvalue of the covariance operator \(\psi \mapsto \int_{\mathcal{X}} \psi(\boldsymbol{x}) k(\boldsymbol{x}, \cdot) \mathrm{d} \boldsymbol{x}\), written in decreasing order of \(\lambda_{i}\). What is the orthogonal basis \(\{\phi_{i}\}_i\) though? That depends on the problem and can be a lot of work to calculate.

In the case that our field is stationary on a “nice” domain, though, this can easy — we simply have the Fourier features as the natural basis.

Figure 2

4 Compactly-supported basis functions

As seen in GPs as SDEs and FEMs (Lindgren, Rue, and Lindström 2011; Lord, Powell, and Shardlow 2014).

5 “Decoupled” bases

Cheng and Boots (2017);Salimbeni et al. (2018);Shi, Titsias, and Mnih (2020);Wilson et al. (2020).

6 References

Ambikasaran, Foreman-Mackey, Greengard, et al. 2015. Fast Direct Methods for Gaussian Processes.” arXiv:1403.6015 [Astro-Ph, Stat].
Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.
Cheng, and Boots. 2017. Variational Inference for Gaussian Process Models with Linear Complexity.” In Advances in Neural Information Processing Systems.
Cressie, and Huang. 1999. Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” Journal of the American Statistical Association.
Cressie, and Johannesson. 2008. Fixed Rank Kriging for Very Large Spatial Data Sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Cressie, Shi, and Kang. 2010. Fixed Rank Filtering for Spatio-Temporal Data.” Journal of Computational and Graphical Statistics.
Cressie, and Wikle. 2011. Statistics for Spatio-Temporal Data. Wiley Series in Probability and Statistics 2.0.
———. 2014. Space-Time Kalman Filter.” In Wiley StatsRef: Statistics Reference Online.
Dahl, and Bonilla. 2019. Sparse Grouped Gaussian Processes for Solar Power Forecasting.” arXiv:1903.03986 [Cs, Stat].
Dubrule. 2018. Kriging, Splines, Conditional Simulation, Bayesian Inversion and Ensemble Kalman Filtering.” In Handbook of Mathematical Geosciences: Fifty Years of IAMG.
Finley, Banerjee, and Gelfand. 2015. spBayes for Large Univariate and Multivariate Point-Referenced Spatio-Temporal Data Models.” Journal of Statistical Software.
Ghanem, and Spanos. 1990. Polynomial Chaos in Stochastic Finite Elements.” Journal of Applied Mechanics.
Gilboa, Saatçi, and Cunningham. 2015. Scaling Multidimensional Inference for Structured Gaussian Processes.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Gulian, Frankel, and Swiler. 2020. Gaussian Process Regression Constrained by Boundary Value Problems.” arXiv:2012.11857 [Cs, Math, Stat].
Hu, Steinsland, Simpson, et al. 2013. Spatial Modelling of Temperature and Humidity Using Systems of Stochastic Partial Differential Equations.”
Lei, Li, Gao, et al. 2018. A Data-Driven Framework for Sparsity-Enhanced Surrogates with Arbitrary Mutually Dependent Randomness.”
Le, Sarlós, and Smola. 2013. Fastfood-Approximating Kernel Expansions in Loglinear Time.” In Proceedings of the International Conference on Machine Learning.
Lindgren, Rue, and Lindström. 2011. An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Liu, Ray, and Hooker. 2014. Functional Principal Components Analysis of Spatially Correlated Data.” arXiv:1411.4681 [Math, Stat].
Lord, Powell, and Shardlow. 2014. An Introduction to Computational Stochastic PDEs. Cambridge Texts in Applied Mathematics.
Luo. 2006. Wiener Chaos Expansion and Numerical Solutions of Stochastic Partial Differential Equations.”
Miller, Glennie, and Seaton. 2020. Understanding the Stochastic Partial Differential Equation Approach to Smoothing.” Journal of Agricultural, Biological and Environmental Statistics.
Nguyen, Cressie, and Braverman. 2012. Spatial Statistical Data Fusion for Remote Sensing Applications.” Journal of the American Statistical Association.
Nowak, and Litvinenko. 2013. Kriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.” Mathematical Geosciences.
O’Hagan. 2013. “Polynomial Chaos: A Tutorial and Critique from a Statistician’s Perspective.”
Petra, Martin, Stadler, et al. 2014. A Computational Framework for Infinite-Dimensional Bayesian Inverse Problems, Part II: Stochastic Newton MCMC with Application to Ice Sheet Flow Inverse Problems.” SIAM Journal on Scientific Computing.
Phillips, Seror, Hutchinson, et al. 2022. Spectral Diffusion Processes.” In.
Queipo, Haftka, Shyy, et al. 2005. Surrogate-Based Analysis and Optimization.” Progress in Aerospace Sciences.
Rahimi, and Recht. 2007. Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems.
———. 2008. Uniform Approximation of Functions with Random Bases.” In 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
———. 2009. Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems.
Riutort-Mayol, Bürkner, Andersen, et al. 2020. Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.” arXiv:2004.11408 [Stat].
Salimbeni, Cheng, Boots, et al. 2018. Orthogonally Decoupled Variational Gaussian Processes.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Särkkä, Solin, and Hartikainen. 2013. Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine.
Shi, Titsias, and Mnih. 2020. Sparse Orthogonal Variational Inference for Gaussian Processes.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.
Solin, and Kok. 2019. Know Your Boundaries: Constraining Gaussian Processes by Variational Harmonic Features.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.
Solin, and Särkkä. 2020. Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing.
Stein. 2008. A Modeling Approach for Large Spatial Datasets.” Journal of the Korean Statistical Society.
Sutherland, and Schneider. 2015. On the Error of Random Fourier Features.” In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.
Tiao, Dutordoir, and Picheny. 2023. Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes.” In.
Wilson, Borovitskiy, Terenin, et al. 2020. Efficiently Sampling Functions from Gaussian Process Posteriors.” In Proceedings of the 37th International Conference on Machine Learning.
Zammit-Mangion, and Cressie. 2021. FRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets.” Journal of Statistical Software.