# Efficient factoring of GP likelihoods

October 16, 2020 — October 26, 2020

algebra
approximation
Gaussian
generative
graphical models
Hilbert space
kernel tricks
machine learning
networks
optimization
probability
statistics

There are many ways to cleverly slice up GP likelihoods so that inference is cheap.

This page is about some of them, especially the union of sparse and variational tricks. Scalable Gaussian process regressions choose cunning factorisations such that the model collapses down to a lower-dimensional thing than it might have seemed to need, at least approximately. There is a comptilation of tricks to make this go — variational approximations a model, sparse GP models where there are a small number of inducing points . You might suspect yourself of using such a method if you find that some important high-dimensional expectation can be evaluated by some function of univariate Gaussians.

This is a related notion to other tricks which factorise a distribution cleverly, such message-passing inference. There are indeed a lot of different factorisations that can be done here; See filtering GPs for one which factorizes over a single input axis. Also Toeplitz and related structures work out nicely for, e.g. lattice-distributed inputs and some other situations I forget right now. I will bet you they can all be used together.

## 3 Spectral and rank sparsity

Loosely speaking, where the functions can be represented in a small number of basis functions. See, for, example .

## 4 SVI for Gaussian processes

As seen in Hensman, Fusi, and Lawrence (2013);Salimbeni and Deisenroth (2017).

## 5 Low rank methods

Represent the GP in terms of a controlled budget of basis functions. See low-rank Gaussian processes.

## 6 Vecchia factorisation

Approximate the precision matrix by one with a sparse cholesky factorisation. See Vecchia factorization.

## 8 Latent Gaussian Process models

The Edwin V. Bonilla, Krauth, and Dezfouli (2019) set up for Latent Gaussian Process models (“LGPMs”) goes as follows:

We are learning a mapping $$\boldsymbol{f}:\mathbb{R}^D\to\mathbb{R}^P$$ from data. The dataset looks like $$\mathcal{D}=\left\{\mathbf{x}_{n}, \mathbf{y}_{n}\right\}_{n=1}^{N}\equiv \left\{\mathbf{x}, \mathbf{y}\right\}.$$ $$\mathbf{x}_{n}\in \mathbb{R}^D$$ is an input vector and $$\mathbf{y}_{n}\in\mathbb{R}^P$$ is an output. We decree that the mapping from inputs to outputs, may be expressed by $$Q$$ underlying latent functions $$\left\{f_{j}\right\}_{j=1}^{Q}.$$ We assume that the $$Q$$ latent functions $$\left\{f_{j}\right\}$$ are drawn from (a priori) independent zero-mean Gaussian processes.

\begin{aligned} p\left(f_{j} \mid \boldsymbol{\theta}_{j}\right) & \sim \mathcal{G} \mathcal{P}\left(0, \kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)\right), \quad j=1, \ldots Q, \quad \text { and } \\ p(\mathbf{f} \mid \boldsymbol{\theta}) &=\prod_{j=1}^{Q} p\left(\mathbf{f}_{\cdot j} \mid \boldsymbol{\theta}_{j}\right) \\ &=\prod_{j=1}^{Q} \mathcal{N}\left(\mathbf{f}_{\cdot j} ; \mathbf{0}, \mathbf{K}_{\mathbf{x x}}^{j}\right). \end{aligned} Here $$\mathbf{f}$$ is the set of all latent function values; $$\mathbf{f}_{\cdot j}=\left\{f_{j}\left(\mathbf{x}_{n}\right)\right\}_{n=1}^{N}$$ denotes the values of latent function $$j$$. The Gram matrix is $$\mathbf{K}_{\mathrm{xx}}^{j}$$, induced by a covariance kernel, $$\kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)$$. The parameters of all kernel functions we call $$\boldsymbol{\theta}=\left\{\boldsymbol{\theta}_{j}\right\}.$$ Our observation model can have various likelihoods; We call the corresponding parameter $$\boldsymbol{\phi}$$. We assume that our multi-dimensional observations $$\left\{\mathbf{y}_{n}\right\}$$ are i.i.d. given the latent functions $$\left\{\mathbf{f}_{n}\right\},$$ so that $p(\mathbf{y} \mid \mathbf{f}, \boldsymbol{\phi})=\prod_{n=1}^{N} p\left(\mathbf{y}_{n} \mid \mathbf{f}_{n \cdot}, \boldsymbol{\phi}\right)$ $$\mathbf{f}_{n\cdot}=\{f_{j}(\boldsymbol{x}_n)\}_{j=1}^{q}$$ is the set of latent $$\boldsymbol{f}$$ values upon which $$\mathbf{y}_{n}$$ depends.

There are several factorizations to note here

1. The prior is factored into latent functions per-coordinate
2. the conditional likelihood is factored over observations (i.e. nosie in independent)

If we further factorise the variational approximation in some way this will work out nicely, e.g. into Gaussian mixtures. This is going to work out well for us when we try to devise a system of inference later to minimise the ELBO. TBC.

For now, though, let us examine exactly tractable inference

## 9 References

Ameli, and Shadden. 2023. Applied Mathematics and Computation.
Barfoot. 2020.
Bonilla, Edwin V. 2017. “Variational Learning of GP Models.”
Bonilla, Edwin V., Chai, and Williams. 2007. In Proceedings of the 20th International Conference on Neural Information Processing Systems. NIPS’07.
Bonilla, Edwin V., Krauth, and Dezfouli. 2019. Journal of Machine Learning Research.
Bruinsma, Perim, Tebbutt, et al. 2020. In International Conference on Machine Learning.
Dahl, and Bonilla. 2017. In Data Analytics for Renewable Energy Integration: Informing the Generation and Distribution of Renewable Energy. Lecture Notes in Computer Science.
Dahl, and Bonilla. 2019. arXiv:1903.03986 [Cs, Stat].
Dezfouli, and Bonilla. 2015. In Advances in Neural Information Processing Systems 28. NIPS’15.
Dutordoir, Hensman, van der Wilk, et al. 2021. In arXiv:2105.04504 [Cs, Stat].
Galy-Fajou, Perrone, and Opper. 2021. Entropy.
Hensman, Fusi, and Lawrence. 2013. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. UAI’13.
Krauth, Bonilla, Cutajar, et al. 2016. In UAI17.
Lázaro-Gredilla, and Figueiras-Vidal. 2009. In Advances in Neural Information Processing Systems.
Leibfried, Dutordoir, John, et al. 2022.
Lemercier, Salvi, Cass, et al. 2021. In Proceedings of the 38th International Conference on Machine Learning.
Matthews. 2017.
Meanti, Carratino, Rosasco, et al. 2020. In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Nguyen, and Bonilla. 2014. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.
Nowak, and Litvinenko. 2013. Mathematical Geosciences.
Qi, Abdel-Gawad, and Minka. 2010. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI’10.
Ritter, Kukla, Zhang, et al. 2021. arXiv:2105.14594 [Cs, Stat].
Rossi, Heinonen, Bonilla, et al. 2021. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.
Saatçi, Turner, and Rasmussen. 2010. In Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10.
Salimbeni, and Deisenroth. 2017. In Advances In Neural Information Processing Systems.
Shi, Titsias, and Mnih. 2020. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.
Snelson, and Ghahramani. 2007. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.
Solin, and Särkkä. 2020. Statistics and Computing.
Tiao, Dutordoir, and Picheny. 2023. In.
Titsias. 2009. In International Conference on Artificial Intelligence and Statistics.
Wilson, Knowles, and Ghahramani. 2012. In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.
Zammit-Mangion, and Cressie. 2021. Journal of Statistical Software.
Zhang, Yufeng, Liu, Chen, et al. 2022.
Zhang, Rui, Walder, Bonilla, et al. 2020. In Proceedings of NeurIPS 2020.