A classic trick in factoring GP likelihoods. Sparse Gaussian processes approximate the target posterior by summarising the data with a short list of inducing points that have nearly the same posterior density, by some metric; i.e. this is how data summarization works in the context of Gaussian processes. As with many general concepts specialised upon Gaussian processes, data summarisation ends up being especially feasible in this context and pairs nicely with variational inference.
These have been invented in various forms by various people, but most of the differences we can ignore. Based on citations, the Ground Zero for the Right Way is the version of Titsias (2009), which is the foundation stone of all subsequent methods. A summary of Titsias (2009) from Hensman, Fusi, and Lawrence (2013):
We consider an output vector
Quoting them:
We first apply Jensen’s inequality on the conditional probability
where denotes an expectation under For Gaussian noise, taking the expectation inside the log is tractable, but it results in an expression containing which has a computational complexity of Bringing the expectation outside the log gives a lower bound, which can be computed with complexity Further, when factorises across the data, then this lower bound can be shown to be separable across y giving where and is the th diagonal element of . Note that the difference between our bound and the original log likelihood is given by the Kullback-Leibler (KL) divergence between the posterior over the mapping function given the data and the inducing variables and the posterior of the mapping function given the inducing variables only, This KL divergence is minimized when there are inducing variables and they are placed at the training data locations. This means that , meaning that In this case we recover and the bound becomes equality because is degenerate. However, since there would be no computational or storage advantage from the representation. When the bound can be maximised with respect to (which are variational parameters). This minimises the KL divergence and ensures that are distributed amongst the training data such that all are small. In practice this means that the expectations in (1) are only taken across a narrow domain ( is the marginal variance of ), keeping Jensen’s bound tight. …we recover the bound of Titsias (2009) by marginalising the inducing variables,
which with some linear algebraic manipulation leads to matching the result of Titsias, with the implicit approximating distribution having precision and mean