GP inducing variables

2020-10-16 — 2024-11-26

Wherein sparse Gaussian-process posteriors are approximated by inducing variables, Jensen’s inequality is applied to obtain a variational lower bound computable in O(m^3), and a trace term penalises residual covariance.

algebra

approximation

functional analysis

Gaussian

generative

graphical models

Hilbert space

kernel tricks

machine learning

optimization

probability

statistics

A classic trick in factoring GP likelihoods. Sparse Gaussian processes approximate the target posterior by summarising the data with a short list of inducing points that have nearly the same posterior density, by some metric; i.e. this is how data summarization works in the context of Gaussian processes. As with many general concepts specialised upon Gaussian processes, data summarisation ends up being especially feasible in this context and pairs nicely with variational inference.

These have been invented in various forms by various people, but most of the differences we can ignore. Based on citations, the Ground Zero for the Right Way is the version of Titsias (2009), which is the foundation stone of all subsequent methods. A summary of Titsias (2009) from Hensman, Fusi, and Lawrence (2013):

We consider an output vector \(\mathbf{y},\) where each entry \(y_{i}\) is a noisy observation of the function \(f,\) i.e. \(y_{i}=f(\mathbf{x}_{i})+\varepsilon_{i}\) for all the points \(\mathbf{X}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{n}.\) We assume the noises \(\varepsilon_{i}\sim\mathcal{N}(0,1/\beta)\) to be independent. We introduce a Gaussian process prior over \(f(\cdot).\) Now let the vector \(\mathbf{f}=\{f(x_{i})\}_{i=1}^{n}\) contain values of the function evaluated at the inputs. We introduce a set of inducing variables, defined similarly: let the vector \(\mathbf{u}=\{f(\mathbf{z}_{i})\}_{i=1}^{m}\). Standard properties of Gaussian processes allow us to write \[ \begin{aligned} p(\mathbf{y} \mid \mathbf{f}) &=\mathcal{N}\left(\mathbf{y} \mid \mathbf{f}, \beta^{-1} \mathbf{I}\right) \\ p(\mathbf{f} \mid \mathbf{u}) &=\mathcal{N}\left(\mathbf{f} \mid \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{u}, \tilde{\mathbf{K}}\right) \\ p(\mathbf{u}) &=\mathcal{N}\left(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_{m m}\right) \end{aligned} \] where \(\mathbf{K}_{m m}\) is the covariance function evaluated between all the inducing points and \(\mathbf{K}_{n m}\) is the covariance function between all inducing points and training points and we have defined \(\tilde{\mathbf{K}}=\mathbf{K}_{n n}-\mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n}\)

Quoting them:

We first apply Jensen’s inequality on the conditional probability \(p(\mathbf{y} \mid \mathbf{u})\) \[ \begin{aligned} \log p(\mathbf{y} \mid \mathbf{u}) &=\log \langle p(\mathbf{y} \mid \mathbf{f})\rangle_{p(\mathbf{f} \mid \mathbf{u})} \\ & \geq\langle\log p(\mathbf{y} \mid \mathbf{f})\rangle_{p(\mathbf{f} \mid \mathbf{u})} \triangleq \mathcal{L}_{1} \end{aligned} \] where \(\langle\cdot\rangle_{p(x)}\) denotes an expectation under \(p(x).\) For Gaussian noise, taking the expectation inside the log is tractable, but it results in an expression containing \(\mathbf{K}_{n n}^{-1},\) which has a computational complexity of \(\mathcal{O}\left(n^{3}\right).\) Bringing the expectation outside the log gives a lower bound, \(\mathcal{L}_{1},\) which can be computed with complexity \(\mathcal{O}\left(m^{3}\right).\) Further, when \(p(\mathbf{y} \mid \mathbf{f})\) factorises across the data, \[ p(\mathbf{y} \mid \mathbf{f})=\prod_{i=1}^{n} p\left(y_{i} \mid f_{i}\right) \] then this lower bound can be shown to be separable across y giving \[ \exp \left(\mathcal{L}_{1}\right)=\prod_{i=1}^{n} \mathcal{N}\left(y_{i} \mid \mu_{i}, \beta^{-1}\right) \exp \left(-\frac{1}{2} \beta \tilde{k}_{i, i}\right) \] where \(\boldsymbol{\mu}=\mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{u}\) and \(\tilde{k}_{i, i}\) is the \(i\)th diagonal element of \(\widetilde{\mathbf{K}}\). Note that the difference between our bound and the original log likelihood is given by the Kullback-Leibler (KL) divergence between the posterior over the mapping function given the data and the inducing variables and the posterior of the mapping function given the inducing variables only, \[ \mathrm{KL}(p(\mathbf{f} \mid \mathbf{u}) \| p(\mathbf{f} \mid \mathbf{u}, \mathbf{y})) \] This KL divergence is minimized when there are \(m=n\) inducing variables and they are placed at the training data locations. This means that \(\mathbf{u}=\mathbf{f}\), \(\mathbf{K}_{m m}=\mathbf{K}_{n m}=\mathbf{K}_{n n}\) meaning that \(\tilde{\mathbf{K}}=\mathbf{0}.\) In this case we recover \(\exp \left(\mathcal{L}_{1}\right)=p(\mathbf{y} \mid \mathbf{f})\) and the bound becomes equality because \(p(\mathbf{f} \mid \mathbf{u})\) is degenerate. However, since \(m=n\) there would be no computational or storage advantage from the representation. When \(m<n\) the bound can be maximised with respect to \(\mathbf{Z}\) (which are variational parameters). This minimises the KL divergence and ensures that \(\mathbf{Z}\) are distributed amongst the training data \(\mathbf{X}\) such that all \(\tilde{k}_{i, i}\) are small. In practice this means that the expectations in (1) are only taken across a narrow domain (\(\tilde{k}_{i, i}\) is the marginal variance of \(p\left(f_{i} \mid \mathbf{u}\right)\) ), keeping Jensen’s bound tight.

…we recover the bound of Titsias (2009) by marginalising the inducing variables, \[ \begin{aligned} \log p(\mathbf{y} \mid \mathbf{X}) &=\log \int p(\mathbf{y} \mid \mathbf{u}) p(\mathbf{u}) \mathrm{d} \mathbf{u} \\ & \geq \log \int \exp \left\{\mathcal{L}_{1}\right\} p(\mathbf{u}) \mathrm{d} \mathbf{u} \triangleq \mathcal{L}_{2} \end{aligned} \] which with some linear algebraic manipulation leads to \[ \mathcal{L}_{2}=\log \mathcal{N}\left(\mathbf{y} \mid \mathbf{0}, \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n}+\beta^{-1} \mathbf{I}\right)-\frac{1}{2} \beta \operatorname{tr}(\widetilde{\mathbf{K}}) \] matching the result of Titsias, with the implicit approximating distribution \(q(\mathbf{u})\) having precision \[ \mathbf{\Lambda}=\beta \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n} \mathbf{K}_{n m} \mathbf{K}_{m m}^{-1}+\mathbf{K}_{m m}^{-1} \] and mean \[ \hat{\mathbf{u}}=\beta \mathbf{\Lambda}^{-1} \mathbf{K}_{m m}^{-1} \mathbf{K}_{m n} \mathbf{y} \]

1 References

Adam, Eleftheriadis, Durrande, et al. 2020. “Doubly Sparse Variational Gaussian Processes.” In AISTATS.

Hensman, Fusi, and Lawrence. 2013. “Gaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. UAI’13.

Lemercier, Salvi, Cass, et al. 2021. “SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data.” In Proceedings of the 38th International Conference on Machine Learning.

Ritter, Kukla, Zhang, et al. 2021. “Sparse Uncertainty Representation in Deep Learning with Inducing Weights.” arXiv:2105.14594 [Cs, Stat].

Rossi, Heinonen, Bonilla, et al. 2021. “Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Titsias. 2009. “Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In International Conference on Artificial Intelligence and Statistics.