Gaussian Processes as stochastic differential equations
Imposing time on things
September 18, 2019 — November 25, 2021
🏗️🏗️🏗️ Under heavy construction 🏗️🏗️🏗️
Classic flavours together, Gaussian processes and state filters/ stochastic differential equations and random fields as stochastic differential equations.
Not covered, another concept which includes the same keywords but is distinct: using Gaussian processes to define state process dynamics or observation distribution.
1 GP regression via state filtering
I am interested in the trick which makes certain Gaussian process regression problems soluble by making them local, i.e. Markov, with respect to some assumed hidden state, in the same way Kalman filtering does Wiener filtering. This means you get to solve a GP as an SDE using a state filter.
The GP-filtering trick is explained in intro articles (Särkkä, Solin, and Hartikainen 2013; Lindgren, Bolin, and Rue 2021), based on various antecedents (O’Hagan 1978; S. Reece and Roberts 2010; Lindgren, Rue, and Lindström 2011; Särkkä and Hartikainen 2012; J. Hartikainen and Särkkä 2010; Solin 2016; Huber 2014), possibly also (Csató and Opper 2002). Aside: O’Hagan (1978) is an incredible paper that invented several research areas at once (GP regression, surrogate models for experiment design as well as this) and AFAICT no one noticed at the time. Also Whittle did some foundational work, but I cannot find the original paper to read it.
The idea is that if your GP covariance kernel is (or can be well approximated by) a rational function then it is possible to factorise it into a tractable state space model, using a duality between random fields and stochastic differential equations. That sounds simple enough conceptually; I wonder about the practice. Of course, when you want some complications, such as non-stationary kernels or hierarchical models, this state space inference trick gets more complicated, and posterior distributions are no longer so simple. But possibly it can still go. (This is a research interest of mine.)
William J. Wilkinson et al. (2020) introduces a computational toolkit and many worked examples of inference algorithms. Cox, van de Laar, and de Vries (2019) looks like it might be solving a similar problem but I do not yet understand their framing.
This complements, perhaps, the trick of fast Gaussian process calculations on lattices.
Nickisch, Solin, and Grigorevskiy (2018) tries to introduce a vocabulary for inference based on this insight, by discussing it in terms of computational primitives
In time-series data, with D = 1, the data sets tend to become long (or unbounded) when observations accumulate over time. For these time-series models, leveraging sequential state space methods from signal processing makes it possible to solve GP inference problems in linear time complexity O(n) if the underlying GP has Markovian structure (S. Reece and Roberts 2010; J. Hartikainen and Särkkä 2010). This reformulation is exact for Markovian covariance functions (see, e.g., Solin (2016)) such as the exponential, half-integer Matérn, noise, constant, linear, polynomial, Wiener, etc. (and their sums and products).…
While existing literature has focused on the connection between GP regression and state space methods, the computational primitives allowing for inference using general likelihoods in combination with the Laplace approximation (LA), variational Bayes (VB), and assumed density filtering (ADF, a.k.a. single-sweep expectation propagation, EP) schemes has been largely overlooked.… We present a unifying framework for solving computational primitives for non-Gaussian inference schemes in the state space setting, thus directly enabling inference to be done through LA, VB, KL, and ADF/EP.
The following computational primitives allow to cast the covariance approximation in more generic terms: 1. Linear system with “regularized” covariance: \[ \text { solve }_{\mathbf{K}}(\mathbf{W}, \mathbf{r}):=\left(\mathbf{K}+\mathbf{W}^{-1}\right)^{-1} \mathbf{r} \] 2. Matrix-vector multiplications: \(\operatorname{mvm}_{\mathbf{K}}(\mathbf{r}):=\mathbf{K r}\). For learning we also need \(\frac{\operatorname{mvm}_{K}(\mathbf{r})}{\partial \theta}\). 3. Log-determinants: \(\operatorname{ld}_{\mathbf{K}}(\mathbf{W}):=\log |\mathbf{B}|\) with symmetric and well-conditioned \(\mathbf{B}=\mathbf{I}+\mathbf{W}^{\frac{1}{2}} \mathbf{K} \mathbf{W}^{\frac{1}{2}}\). For learning, we need derivatives: \(\frac{\partial \operatorname{ld} \mathbf{K}(\mathbf{W})}{\partial \boldsymbol{\theta}}, \frac{\partial \operatorname{ld} \mathbf{K}(\mathbf{W})}{\partial \mathbf{W}}\) 4. Predictions need latent mean \(\mathbb{E}\left[f_{*}\right]\) and variance \(\mathbb{V}\left[f_{*}\right]\).
Using these primitives, GP regression can be compactly written as \(\mathbf{W}=\mathbf{I} / \sigma_{n}^{2}, \boldsymbol{\alpha}=\operatorname{solve}_{\mathbf{K}}(\mathbf{W}, \mathbf{y}-\mathbf{m}),\) and \(\log Z_{\mathrm{GPR}}=\) \[ -\frac{1}{2}\left[\boldsymbol{\alpha}^{\top} \mathrm{mvm}_{\mathrm{K}}(\boldsymbol{\alpha})+\mathrm{ld}_{\mathrm{K}}(\mathbf{W})+n \log \left(2 \pi \sigma_{n}^{2}\right)\right] \] Approximate inference \((\mathrm{LA}, \mathrm{VB}, \mathrm{KL}, \mathrm{ADF} / \mathrm{EP})-\) in case of non-Gaussian likelihoods - requires these primitives as necessary building blocks. Depending on the covariance approximation method e.g. exact, sparse, grid-based, or state space, the four primitives differ in their implementation and computational complexity.
Recent works I should also inspect include (Chang et al. 2020; Gorad, Zhao, and Särkkä 2020; Nickisch, Solin, and Grigorevskiy 2018; Remes, Heinonen, and Kaski 2018; Solin 2016; William J. Wilkinson et al. 2019a; William J. Wilkinson et al. 2019c; Karvonen and Särkkä 2016).
Ambikasaran et al. (2015) seems to be related but not quite the same — it operates time-wise over inputs but then constructs the GP posterior using rank-1 updates.
2 Spatio-temporal usage
Solving PDEs by filtering! (Cressie and Wikle 2014; Karvonen and Särkkä 2016; Lindgren, Rue, and Lindström 2011; Särkkä and Hartikainen 2012; Särkkä, Solin, and Hartikainen 2013; Solin and Särkkä 2013; Solin 2016; Zammit-Mangion and Wikle 2020)
3 Latent force models
I am going to argue that some latent force models fit in this classification, if I ever get time to define them (M. Álvarez, Luengo, and Lawrence 2009; M. A. Álvarez, Luengo, and Lawrence 2013; Jouni Hartikainen and Särkkä 2011; Jouni Hartikainen, Seppänen, and Särkkä 2012; Steven Reece et al. 2014; Särkkä, Álvarez, and Lawrence 2019).
4 Miscellaneous notes towards implementation
- Julia’s DynamicPolynomials.jl implements the MultivariatePolynomials, appears to be differentiable and handles rational polynomials.
- TemporalGPs.jl, introduced by Will Tebbutt, is a julia implementation of this.
- The abstract ForneyLab.jl system might relate to this, behind its abstruse framing. Cox, van de Laar, and de Vries (2019)
- https://pyro.ai/examples/dkl.html
- https://pyro.ai/examples/gp.html
- https://pyro.ai/examples/ekf.html
- https://julialang.org/blog/2019/01/fluxdiffeq
- https://github.com/FluxML/model-zoo/tree/master/contrib/diffeq
- http://pyro.ai/examples/gplvm.html
- http://pyro.ai/examples/dmm.html
- http://docs.pyro.ai/en/stable/contrib.gp.html
- https://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/index.ipynb
- https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
See also
- AaltoML/BayesNewton: Bayes-Newton—A Gaussian process library in JAX, with a unifying view of approximate Bayesian inference as variants of Newton’s method. (William J. Wilkinson, Särkkä, and Solin 2021)
- AaltoML/nonstationary-audio-gp: End-to-End Probabilistic Inference for Nonstationary Audio Analysis
- wil-j-wil/gp-time-freq: Gaussian process time-frequency analysis toy examples
- wil-j-wil/unifying-prob-time-freq: Unifying probabilistic models for time-frequency analysis. Spectral Mixture Gaussian processes in state space form.