# Bayes linear regression and basis-functions in Gaussian process regression

a.k.a Fixed Rank Kriging, weight space GPs

February 22, 2022 — July 27, 2022

algebra
approximation
Gaussian
generative
graphical models
Hilbert space
kernel tricks
machine learning
networks
optimization
probability
statistics

Another way of cunningly chopping up the work of fitting a Gaussian process is to represent the process as a random function comprising basis functions $$\phi=\left(\phi_{1}, \ldots, \phi_{\ell}\right)$$ with the Gaussian random weight vector $$w$$ so that $f^{(w)}(\cdot)=\sum_{i=1}^{\ell} w_{i} \phi_{i}(\cdot) \quad \boldsymbol{w} \sim \mathcal{N}\left(\mathbf{0}, \boldsymbol{\Sigma}_{\boldsymbol{w}}\right).$ $$f^{(w)}$$ is a random function satisfying $$\boldsymbol{f}^{(\boldsymbol{w})} \sim \mathcal{N}\left(\mathbf{0}, \boldsymbol{\Phi}_{n} \boldsymbol{\Sigma}_{\boldsymbol{w}} \boldsymbol{\Phi}^{\top}\right)$$, where $$\boldsymbol{\Phi}_{n}=\boldsymbol{\phi}(\mathbf{X})$$ is a $$|\mathbf{X}| \times \ell$$ matrix of features. This is referred to as a weight space approach in ML.

TODO: I just assumed centred weights here, but that is crazy. Update to relax that assumption.

We might imagine this representation would be exact if we had countably many basis functions, and under sane conditions it is. We would like to know, further, that we can find a basis such that we need not too many basis functions to represent the process. Looking at the Karhunen-Loève theorem theorem we might imagine that this can sometimes work out fine, and indeed it does, sometimes.

This is a classic; see Chapter 3 of Bishop (2006) is classic and nicely clear. Cressie and Wikle (2011) targets for the spatiotemporal context.

Hijinks ensue when selecting the basis functions. If we were to treat the natural Hilbert space here seriously we could consider identifying the bases as eigenfunctions of the kernel. This is not generally easy. We tend to use either global bases such as Fourier bases or more generally Karhunen-Loéve bases, or construct local bases of limited overlap (usually piecewise polynomials AFAICT).

The kernel trick writes a kernel $$k$$ as an inner product in a corresponding reproducing kernel Hilbert space (RKHS) $$\mathcal{H}_{k}$$ with a feature map $$\varphi: \mathcal{X} \rightarrow \mathcal{H}_{k} .$$ In sufficiently nice cases the kernel is well approximated $k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\varphi(\boldsymbol{x}), \varphi\left(\boldsymbol{x}^{\prime}\right)\right\rangle_{\mathcal{H}_{k}} \approx \boldsymbol{\phi}(\boldsymbol{x})^{\top} \overline{\boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right)}$ where $$\boldsymbol{\phi}: \mathcal{X} \rightarrow \mathbb{C}^{\ell}$$ is a finite-dimensional feature map. TODO: What is the actual guarantee here?

## 1 Fourier features

When the Fourier basis is natural for the problem we are in a pretty good situation. We can use the Wiener Khintchine relations to analyse and simulate the process. Connection perhaps to Fourier features in neural nets?

## 2 Random Fourier features

The random Fourier features method constructs a Monte Carlo estimate to a stationary kernel by representing the inner product in terms of $$\ell$$ complex exponential basis functions $$\phi_{j}(\boldsymbol{x})=\ell^{-1 / 2} \exp \left(i \boldsymbol{\omega}_{j}^{\top} \boldsymbol{x}\right)$$ with frequency parameters $$\boldsymbol{\omega}_{j}$$ sampled proportionally to the spectral density $$\rho\left(\boldsymbol{\omega}_{j}\right).$$

This sometimes has a favourable error rate .

## 3 K-L basis

We recall from the Karhunen-Loéve notebook that the mean-square-optimal $$f^{(w)}$$ for approximating a Gaussian process $$f$$ is found by truncating the Karhunen-Loéve expansion $f(\cdot)=\sum_{i=1}^{\infty} w_{i} \phi_{i}(\cdot) \quad w_{i} \sim \mathcal{N}\left(0, \lambda_{i}\right)$ where $$\phi_{i}$$ and $$\lambda_{i}$$ are, respectively, the $$i$$-th (orthogonal) eigenfunction and eigenvalue of the covariance operator $$\psi \mapsto \int_{\mathcal{X}} \psi(\boldsymbol{x}) k(\boldsymbol{x}, \cdot) \mathrm{d} \boldsymbol{x}$$, written in decreasing order of $$\lambda_{i}$$. What is the orthogonal basis $$\{\phi_{i}\}_i$$ though? That depends on the problem and can be a lot of work to calculate.

In the case that our field is stationary on a “nice” domain, though, this can easy — we simply have the Fourier features as the natural basis.

## 4 Compactly-supported basis functions

As seen in GPs as SDEs and FEMs .

## 5 “Decoupled” bases

Cheng and Boots (2017);Salimbeni et al. (2018);Shi, Titsias, and Mnih (2020);Wilson et al. (2020).

## 6 References

Ambikasaran, Foreman-Mackey, Greengard, et al. 2015. arXiv:1403.6015 [Astro-Ph, Stat].
Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.
Cheng, and Boots. 2017. In Advances in Neural Information Processing Systems.
Cressie, and Huang. 1999. Journal of the American Statistical Association.
Cressie, and Johannesson. 2008. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Cressie, Shi, and Kang. 2010. Journal of Computational and Graphical Statistics.
Cressie, and Wikle. 2011. Statistics for Spatio-Temporal Data. Wiley Series in Probability and Statistics 2.0.
———. 2014. In Wiley StatsRef: Statistics Reference Online.
Dahl, and Bonilla. 2019. arXiv:1903.03986 [Cs, Stat].
Dubrule. 2018. In Handbook of Mathematical Geosciences: Fifty Years of IAMG.
Finley, Banerjee, and Gelfand. 2015. Journal of Statistical Software.
Ghanem, and Spanos. 1990. Journal of Applied Mechanics.
Gilboa, Saatçi, and Cunningham. 2015. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Gulian, Frankel, and Swiler. 2020. arXiv:2012.11857 [Cs, Math, Stat].
Hu, Steinsland, Simpson, et al. 2013.
Lei, Li, Gao, et al. 2018.
Le, Sarlós, and Smola. 2013. In Proceedings of the International Conference on Machine Learning.
Lindgren, Rue, and Lindström. 2011. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Liu, Ray, and Hooker. 2014. arXiv:1411.4681 [Math, Stat].
Lord, Powell, and Shardlow. 2014. An Introduction to Computational Stochastic PDEs. Cambridge Texts in Applied Mathematics.
Miller, Glennie, and Seaton. 2020. Journal of Agricultural, Biological and Environmental Statistics.
Nguyen, Cressie, and Braverman. 2012. Journal of the American Statistical Association.
Nowak, and Litvinenko. 2013. Mathematical Geosciences.
O’Hagan. 2013. “Polynomial Chaos: A Tutorial and Critique from a Statistician’s Perspective.”
Petra, Martin, Stadler, et al. 2014. SIAM Journal on Scientific Computing.
Phillips, Seror, Hutchinson, et al. 2022. In.
Queipo, Haftka, Shyy, et al. 2005. Progress in Aerospace Sciences.
Rahimi, and Recht. 2007. In Advances in Neural Information Processing Systems.
———. 2008. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
———. 2009. In Advances in Neural Information Processing Systems.
Riutort-Mayol, Bürkner, Andersen, et al. 2020. arXiv:2004.11408 [Stat].
Salimbeni, Cheng, Boots, et al. 2018. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Särkkä, Solin, and Hartikainen. 2013. IEEE Signal Processing Magazine.
Shi, Titsias, and Mnih. 2020. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.
Solin, and Kok. 2019. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.
Solin, and Särkkä. 2020. Statistics and Computing.
Stein. 2008. Journal of the Korean Statistical Society.
Sutherland, and Schneider. 2015. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.
Tiao, Dutordoir, and Picheny. 2023. In.
Wilson, Borovitskiy, Terenin, et al. 2020. In Proceedings of the 37th International Conference on Machine Learning.
Zammit-Mangion, and Cressie. 2021. Journal of Statistical Software.