Efficient factoring of GP likelihoods



There are many ways to cleverly slice up GP likelihoods so that inference is cheap.

This page is about some of them, especially the union of sparse and variational tricks. Scalable Gaussian process regressions choose cunning factorisations such that the model collapses down to a lower-dimensional thing than it might have seemed to need, at least approximately. There is a comptilation of tricks to make this go β€” variational approximations a model, sparse GP models where there are a small number of inducing points (Dezfouli and Bonilla 2015; Edwin V. Bonilla, Krauth, and Dezfouli 2019; Krauth et al. 2016; Hensman, Fusi, and Lawrence 2013; Salimbeni and Deisenroth 2017). You might suspect yourself of using such a method if you find that some important high-dimensional expectation can be evaluated by some function of univariate Gaussians.

This is a related notion to other tricks which factorise a distribution cleverly, such message-passing inference. There are indeed a lot of different factorisations that can be done here; See filtering GPs for one which factorizes over a single input axis. Also Toeplitz and related structures work out nicely for, e.g. lattice-distributed inputs and some other situations I forget right now. I will bet you they can all be used together.

Inducing variables

Sparse Gaussian processes are ones where you approximate the target posterior by summarising the data with a short list of inducing points that have nearly the same similar posterior density, by some metric. These have been invented in various forms by various people, but most of the differences we can ignore. Based on citations, thethe Right Way is the version of Titsias (2009), which we use in all the subsequent things as a building-block.

Here is my explanation, based on Hensman, Fusi, and Lawrence (2013) summarizing Titsias (2009).

We consider an output vector \(\mathbf{y},\) where each entry \(y_{i}\) is a noisy observation of the function \(f,\) i.e. \(y_{i}=f(\mathbf{x}_{i})+\varepsilon_{i}\) for all the points \(\mathbf{X}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{n}.\) We assume the noises \(\varepsilon_{i}\sim\mathcal{N}(0,1/\beta)\) to be independent. We introduce a Gaussian process prior over \(f(\cdot).\) Now let the vector \(\mathbf{f}=\{f(x_{i})\}_{i=1}^{n}\) contain values of the function evaluated at the inputs. We introduce a set of inducing variables, defined similarly: let the vector \(\mathbf{u}=\{f(\mathbf{z}_{i})\}_{i=1}^{m}\). Standard properties of Gaussian processes allow us to write \[ \begin{aligned} p(\mathbf{y} \mid \mathbf{f}) &=\mathcal{N}\left(\mathbf{y} \mid \mathbf{f}, \beta^{-1} \mathbf{I}\right) \\ p(\mathbf{f} \mid \mathbf{u}) &=\mathcal{N}\left(\mathbf{f} \mid \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{u}, \tilde{\mathbf{K}}\right) \\ p(\mathbf{u}) &=\mathcal{N}\left(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_{m m}\right) \end{aligned} \] where \(\mathbf{K}_{m m}\) is the covariance function evaluated between all the inducing points and \(\mathbf{K}_{n m}\) is the covariance function between all inducing points and training points and we have defined with \(\tilde{\mathbf{K}}=\mathbf{K}_{n n}-\) \(\mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n}\)

Quoting them:

We first apply Jensen’s inequality on the conditional probability \(p(\mathbf{y} \mid \mathbf{u})\) \[ \begin{aligned} \log p(\mathbf{y} \mid \mathbf{u}) &=\log \langle p(\mathbf{y} \mid \mathbf{f})\rangle_{p(\mathbf{f} \mid \mathbf{u})} \\ & \geq\langle\log p(\mathbf{y} \mid \mathbf{f})\rangle_{p(\mathbf{f} \mid \mathbf{u})} \triangleq \mathcal{L}_{1} \end{aligned} \] where \(\langle\cdot\rangle_{p(x)}\) denotes an expectation under \(p(x).\) For Gaussian noise taking the expectation inside the log is tractable, but it results in an expression containing \(\mathbf{K}_{n n}^{-1},\) which has a computational complexity of \(\mathcal{O}\left(n^{3}\right).\) Bringing the expectation outside the log gives a lower bound, \(\mathcal{L}_{1},\) which can be computed with has complexity \(\mathcal{O}\left(m^{3}\right).\) Further, when \(p(\mathbf{y} \mid \mathbf{f})\) factorises across the data, \[ p(\mathbf{y} \mid \mathbf{f})=\prod_{i=1}^{n} p\left(y_{i} \mid f_{i}\right) \] then this lower bound can be shown to be separable across y giving \[ \exp \left(\mathcal{L}_{1}\right)=\prod_{i=1}^{n} \mathcal{N}\left(y_{i} \mid \mu_{i}, \beta^{-1}\right) \exp \left(-\frac{1}{2} \beta \tilde{k}_{i, i}\right) \] where \(\boldsymbol{\mu}=\mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{u}\) and \(\tilde{k}_{i, i}\) is the \(i\) th diagonal element of \(\widetilde{\mathbf{K}}\). Note that the difference between our bound and the original log likelihood is given by the Kullback Leibler (KL) divergence between the posterior over the mapping function given the data and the inducing variables and the posterior of the mapping function given the inducing variables only, \[ \mathrm{KL}(p(\mathbf{f} \mid \mathbf{u}) \| p(\mathbf{f} \mid \mathbf{u}, \mathbf{y})) \] This KL divergence is minimized when there are \(m=n\) inducing variables and they are placed at the training data locations. This means that \(\mathbf{u}=\mathbf{f}\), \(\mathbf{K}_{m m}=\mathbf{K}_{n m}=\mathbf{K}_{n n}\) meaning that \(\tilde{\mathbf{K}}=\mathbf{0} .\) In this case we recover \(\exp \left(\mathcal{L}_{1}\right)=p(\mathbf{y} \mid \mathbf{f})\) and the bound becomes equality because \(p(\mathbf{f} \mid \mathbf{u})\) is degenerate. However, since \(m=n\) and that there would be no computational or storage advantage from the representation. When \(m<n\) the bound can be maximised with respect to \(\mathbf{Z}\) (which are variational parameters). This minimises the KL divergence and ensures that \(\mathbf{Z}\) are distributed amongst the training data \(\mathbf{X}\) such that all \(\tilde{k}_{i, i}\) are small. In practice this means that the expectations in (1) are only taken across a narrow domain (\(\tilde{k}_{i, i}\) is the marginal variance of \(p\left(f_{i} \mid \mathbf{u}\right)\) ), keeping Jensen’s bound tight.

…we recover the bound of Titsias (2009) by marginalising the inducing variables, \[ \begin{aligned} \log p(\mathbf{y} \mid \mathbf{X}) &=\log \int p(\mathbf{y} \mid \mathbf{u}) p(\mathbf{u}) \mathrm{d} \mathbf{u} \\ & \geq \log \int \exp \left\{\mathcal{L}_{1}\right\} p(\mathbf{u}) \mathrm{d} \mathbf{u} \triangleq \mathcal{L}_{2} \end{aligned} \] which with some linear algebraic manipulation leads to \[ \mathcal{L}_{2}=\log \mathcal{N}\left(\mathbf{y} \mid \mathbf{0}, \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n}+\beta^{-1} \mathbf{I}\right)-\frac{1}{2} \beta \operatorname{tr}(\widetilde{\mathbf{K}}) \] matching the result of Titsias, with the implicit approximating distribution \(q(\mathbf{u})\) having precision \[ \mathbf{\Lambda}=\beta \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n} \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1}+\mathbf{K}_{m m}^{-1} \] and mean \[ \hat{\mathbf{u}}=\beta \mathbf{\Lambda}^{-1} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n} \mathbf{y} \]

Spectral and rank sparsity

Loosely speaking, where the functions can be represented in a small number of basis functions. See, for, example (Adam et al. 2020; Zammit-Mangion and Cressie 2021).

SVI for Gaussian processes

As seen in Hensman, Fusi, and Lawrence (2013);Salimbeni and Deisenroth (2017).

Low rank methods

Represent the GP in terms of a controlled budget of basis functions. See low-rank Gaussian processes.

Vecchia factorisation

Approximate the precision matrix by one with a sparse cholesky factorisation. See Vecchia factorization.

Latent Gaussian Process models

Here is the Edwin V. Bonilla, Krauth, and Dezfouli (2019) set up for Latent Gaussian Process models (β€œLGPMs”).

We are learning a mapping \(\boldsymbol{f}:\mathbb{R}^D\to\mathbb{R}^P\) from data. The dataset looks like \(\mathcal{D}=\left\{\mathbf{x}_{n}, \mathbf{y}_{n}\right\}_{n=1}^{N}\equiv \left\{\mathbf{x}, \mathbf{y}\right\}.\) \(\mathbf{x}_{n}\in \mathbb{R}^D\) is an input vector and \(\mathbf{y}_{n}\in\mathbb{R}^P\) is an output. We decree that the mapping from inputs to outputs, may be expressed by \(Q\) underlying latent functions \(\left\{f_{j}\right\}_{j=1}^{Q}.\) We assume that the \(Q\) latent functions \(\left\{f_{j}\right\}\) are drawn from (a priori) independent zero-mean Gaussian processes.

\[ \begin{aligned} p\left(f_{j} \mid \boldsymbol{\theta}_{j}\right) & \sim \mathcal{G} \mathcal{P}\left(0, \kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)\right), \quad j=1, \ldots Q, \quad \text { and } \\ p(\mathbf{f} \mid \boldsymbol{\theta}) &=\prod_{j=1}^{Q} p\left(\mathbf{f}_{\cdot j} \mid \boldsymbol{\theta}_{j}\right) \\ &=\prod_{j=1}^{Q} \mathcal{N}\left(\mathbf{f}_{\cdot j} ; \mathbf{0}, \mathbf{K}_{\mathbf{x x}}^{j}\right). \end{aligned} \] Here \(\mathbf{f}\) is the set of all latent function values; \(\mathbf{f}_{\cdot j}=\left\{f_{j}\left(\mathbf{x}_{n}\right)\right\}_{n=1}^{N}\) denotes the values of latent function \(j\). The Gram matrix is \(\mathbf{K}_{\mathrm{xx}}^{j}\), induced by a covariance kernel, \(\kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)\). The parameters of all kernel functions we call \(\boldsymbol{\theta}=\left\{\boldsymbol{\theta}_{j}\right\}.\) Our observation model can have various likelihoods; We call the corresponding parameter \(\boldsymbol{\phi}\). We assume that our multi-dimensional observations \(\left\{\mathbf{y}_{n}\right\}\) are i.i.d. given the latent functions \(\left\{\mathbf{f}_{n}\right\},\) so that \[ p(\mathbf{y} \mid \mathbf{f}, \boldsymbol{\phi})=\prod_{n=1}^{N} p\left(\mathbf{y}_{n} \mid \mathbf{f}_{n \cdot}, \boldsymbol{\phi}\right) \] \(\mathbf{f}_{n\cdot}=\{f_{j}(\boldsymbol{x}_n)\}_{j=1}^{q}\) is the set of latent \(\boldsymbol{f}\) values upon which \(\mathbf{y}_{n}\) depends.

There are several factorizations to note here

  1. The prior is factored into latent functions per-coordinate
  2. the conditional likelihood is factored over observations (i.e. nosie in independent)

If we further factorise the variational approximation in some way this will work out nicely, e.g. into Gaussian mixtures. This is going to work out well for us when we try to devise a system of inference later to minimise the ELBO. TBC.

For now, though, let us examine exactly tractable inference

References

Adam, Vincent, Stefanos Eleftheriadis, Nicolas Durrande, Artem Artemev, and James Hensman. 2020. β€œDoubly Sparse Variational Gaussian Processes.” In AISTATS.
Bonilla, Edwin V. 2017. β€œVariational Learning of GP Models.” October 11.
Bonilla, Edwin V., Kian Ming A. Chai, and Christopher K. I. Williams. 2007. β€œMulti-Task Gaussian Process Prediction.” In Proceedings of the 20th International Conference on Neural Information Processing Systems, 153–60. NIPS’07. USA: Curran Associates Inc.
Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2019. β€œGeneric Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research 20 (117): 1–63.
Bruinsma, Wessel, Eric Perim, William Tebbutt, Scott Hosking, Arno Solin, and Richard Turner. 2020. β€œScalable Exact Inference in Multi-Output Gaussian Processes.” In International Conference on Machine Learning, 1190–1201. PMLR.
Dahl, Astrid, and Edwin Bonilla. 2017. β€œScalable Gaussian Process Models for Solar Power Forecasting.” In Data Analytics for Renewable Energy Integration: Informing the Generation and Distribution of Renewable Energy, edited by Wei Lee Woon, Zeyar Aung, Oliver Kramer, and Stuart Madnick, 94–106. Lecture Notes in Computer Science. Cham: Springer International Publishing.
Dahl, Astrid, and Edwin V. Bonilla. 2019. β€œSparse Grouped Gaussian Processes for Solar Power Forecasting.” arXiv:1903.03986 [Cs, Stat], March.
Dezfouli, Amir, and Edwin V. Bonilla. 2015. β€œScalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28, 1414–22. NIPS’15. Cambridge, MA, USA: MIT Press.
Galy-Fajou, ThΓ©o, Valerio Perrone, and Manfred Opper. 2021. β€œFlexible and Efficient Inference with Particles for the Variational Gaussian Approximation.” Entropy 23 (8): 990.
Hensman, James, NicolΓ² Fusi, and Neil D. Lawrence. 2013. β€œGaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 282–90. UAI’13. Arlington, Virginia, USA: AUAI Press.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. β€œAutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.
Leibfried, Felix, Vincent Dutordoir, S. T. John, and Nicolas Durrande. 2021. β€œA Tutorial on Sparse Gaussian Processes and Variational Inference.” arXiv:2012.13962 [Cs, Stat], June.
Lemercier, Maud, Cristopher Salvi, Thomas Cass, Edwin V. Bonilla, Theodoros Damoulas, and Terry J. Lyons. 2021. β€œSigGPDE: Scaling Sparse Gaussian Processes on Sequential Data.” In.
Matthews, Alexander Graeme de Garis. 2017. β€œScalable Gaussian Process Inference Using Variational Methods.” Thesis, University of Cambridge.
Meanti, Giacomo. n.d. β€œKernel Methods Through the Roof: Handling Billions of Points Efficiently,” 33.
Nguyen, Trung V., and Edwin V. Bonilla. 2014. β€œAutomated Variational Inference for Gaussian Process Models.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, 1404–12. NIPS’14. Cambridge, MA, USA: MIT Press.
Nowak, W., and A. Litvinenko. 2013. β€œKriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.” Mathematical Geosciences 45 (4): 411–35.
Ritter, Hippolyt, Martin Kukla, Cheng Zhang, and Yingzhen Li. 2021. β€œSparse Uncertainty Representation in Deep Learning with Inducing Weights.” arXiv:2105.14594 [Cs, Stat], May.
Rossi, Simone, Markus Heinonen, Edwin V. Bonilla, Zheyang Shen, and Maurizio Filippone. 2020. β€œRethinking Sparse Gaussian Processes: Bayesian Approaches to Inducing-Variable Approximations,” March.
Rossi, Simone, Markus Heinonen, Edwin Bonilla, Zheyang Shen, and Maurizio Filippone. 2021. β€œSparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 1837–45. PMLR.
SaatΓ§i, Yunus, Ryan Turner, and Carl Edward Rasmussen. 2010. β€œGaussian Process Change Point Models.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 927–34. ICML’10. Madison, WI, USA: Omnipress.
Salimbeni, Hugh, and Marc Deisenroth. 2017. β€œDoubly Stochastic Variational Inference for Deep Gaussian Processes.” In Advances In Neural Information Processing Systems.
Titsias, Michalis K. 2009. β€œVariational Learning of Inducing Variables in Sparse Gaussian Processes.” In International Conference on Artificial Intelligence and Statistics, 567–74. PMLR.
Wilson, Andrew Gordon, David A. Knowles, and Zoubin Ghahramani. 2012. β€œGaussian Process Regression Networks.” In Proceedings of the 29th International Coference on International Conference on Machine Learning, 1139–46. ICML’12. Madison, WI, USA: Omnipress.
Zammit-Mangion, Andrew, and Noel Cressie. 2021. β€œFRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets.” Journal of Statistical Software 98 (May): 1–48.
Zhang, Rui, Christian Walder, Edwin V. Bonilla, Marian-Andrei Rizoiu, and Lexing Xie. 2020. β€œQuantile Propagation for Wasserstein-Approximate Gaussian Processes.” In Proceedings of NeurIPS 2020.
Zhang, Yufeng, Wanwei Liu, Zhenbang Chen, Ji Wang, and Kenli Li. 2022. β€œOn the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.” arXiv.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.