Gaussian random fields
are stochastic processes/fields with jointly Gaussian
distributions of observations.
While “Gaussian process regression” is not wrong *per se*, there is a common convention in stochastic process theory (and also in pedagogy) to use *process* to talk about some notionally time-indexed process and *field* to talk about ones that have a some space-like index without a presumption of an arrow of time.
This leads to much confusion, because Gaussian *field* regression is what we usually want to talk about here. Gaussian processes do not presume an arrow of time.
Although they can use one.
Regardless, hereafter I’ll use “field” and “process” interchangeably.

In machine learning, Gaussian fields are used often as a means of regression or classification, since it is fairly easy to conditionalize a Gaussian field on data and produce a posterior distribution over functions. They provide nonparametric method of inferring regression functions, with a conveniently Bayesian interpretation and reasonably elegant learning and inference steps. I would further add that this is the crystal meth of machine learning methods, in terms of the addictiveness, and of the passion of the people who use it.

The central trick is using a clever union of
Hilbert spaces and probability
to give a probabilistic interpretation of
functional regression as a kind of
nonparametric Bayesian posterior inference via
representer theorems
where one gets posterior distributions over functions.
Regression using Gaussian processes is common
e.g. spatial statistics
where it arises as *kriging*.
Cressie (1990) traces a history of this idea via Matheron (1963a), to works of Krige (1951).

This web site aims to provide an overview of resources concerned with probabilistic modeling, inference and learning based on Gaussian processes. Although Gaussian processes have a long history in the field of statistics, they seem to have been employed extensively only in niche areas. With the advent of kernel machines in the machine learning community, models based on Gaussian processes have become commonplace for problems of regression (kriging) and classification as well as a host of more specialized applications.

I’ve not been enthusiastic about these in the past. It’s nice to have a principled nonparametric Bayesian formalism, but it has always seemed pointless having a formalism that is so computationally demanding that people don’t try to use more than a thousand data points, or spend most of a paper working out how to approximate this simple elegant model with a complex messy model.

However, perhaps I should be persuaded by tricks such as AutoGP (Krauth et al. 2016) which breaks some computational deadlocks by clever use of inducing variables and variational approximation to produce a compressed representation of the data with tractable inference and model selection, including kernel selection, and doing the whole thing in many dimensions simultaneously. There are other clever tricks like this one, e.g (Saatçi 2012) shows how to use a lattice structure for observations to make computation cheap.

## Quick intro

I am not the right guy to provide the canonical introduction, because it already exists. Specifically, the canonical introduction is Rasmussen and Williams (2006). But here is a quick simple special case sufficient to start from.

We work with a centred (i.e. mean-zero) process, in which case for every finite set \(\mathbf{f}:=\{f(t_k);k=1,\dots,K\}\) of realisations of that process, the joint distribution is centred Gaussian,

\[\begin{aligned}
\mathbf{f}(t)
&\sim \operatorname{GP}\left(0, \kappa(t, t';\mathbf{\theta})\right)
\\
p(\mathbf{f}) &=(2\pi )^{-{\frac {K}{2}}}\det({\boldsymbol {\mathrm{K} }})^{-{\frac {1}{2}}}\,e^{-{\frac{1}{2}}\mathbf {f}^{\!{\mathsf {T}}}{\boldsymbol {\mathrm{K} }}^{-1}\mathbf {f}}\\
&=\mathcal{N}(\mathbf{f};\mathbf{0},\textrm{K}).
\end{aligned}\]
where \(\mathrm{K}\) is the sample covariance matrix defined such that its
entries are given by \(\mathrm{K}_{jk}=\kappa(t_j,t_k).\)
In this case, we are specifying *only* the second moments and this is giving us
all the remaining properties of the process.
That is, the unobserved, continuous random function \(f\) generates realisations
\(\mathbf{f}\in\mathbb{R}^T\)
at a discrete times \(\mathbf{t}=t_1,t_2,\dots,t_T.\)

Now,

\[\begin{aligned} f(t) &\sim \operatorname{GP}\left(0, \kappa(t, t';\mathbf{\theta})\right) & \text{Prior} \\ \mathbf{y}|\mathbf{f} &\sim \prod_{k=1}^{T} p\left(y_{k} | f\left(t_{k}\right)\right) & \text{Likelihood} \end{aligned}\]

To begin with these will form a lattice \(\mathbf{t}=1,2,\dots,T.\)

We allow that the observations may be distinct from the realisations in that the realisations may be observed with some noise. The observation noise will be Gaussian also, in the sense that

\[ y=f(\mathbf{x})+\epsilon,\]

where

\[ \epsilon \sim \mathcal{N}\left(0, \sigma_{y}^{2}\right) \]

We refer to the set of observations as \(\mathbf{y}\in\mathbb{R}^T\). The data includes observations and coordinates, and is written \(\mathcal{D}:=\{(t_k, y_k)\}_{k=1,2,\dots,T}\).

The main insight is that the Gaussian prior is conjugate to the Gaussian likelihood, which means that the posterior distributions are also Gaussian. (Although it will no longer be centred.)

We can find a likelihood for the latent functions given the observations by considering the joint distribution

\[ \begin{aligned} \left(\begin{array}{c}{\mathbf{y}} \\ {\mathbf{f}}\end{array}\right) \sim \mathcal{N}\left(\mathbf{0},\left(\begin{array}{cc}{\mathbf{K}_{y}} & {\mathbf{K}} \\ {\mathbf{K}^{T}} & {\mathbf{K}_{\mathbf{f}}}\end{array}\right)\right) \end{aligned} \]

## Density estimation

Can I infer a density using these? Yes. One popular method is apparently the logistic Gaussian process. (Tokdar 2007; Lenk 2003)

## Kernels

a.k.a. covariance models.

GP regression models are kernel machines. As such covariance kernels are the parameters. More or less. One can also parameterise with a mean function, but let us ignore that detail for now because usually we do not use them.

## Using state filtering

When one dimension of the input vector can be interpreted as a time dimension we are Kalman filtering Gaussian Processes, which has benefits in terms of speed.

## On lattice observations

## On manifolds

I would like to read Terenin on GPs on Manifolds who also makes a suggsetive connection to SDEs, which is the filtering GPs trick again.

## By variational inference

🏗

## With inducing variables

“Sparse GP”. See Quiñonero-Candela and Rasmussen (2005). 🏗

## Variational/sparse

See GP factoring.

## Approximation with dropout

Famously Gal and Ghahramani (2015) showsthat training a certain class of networks stochastically using dropout approximates Gaussian processes. Papers like Kasim et al. (2020) level that up, building massive networks that try to do cheap approximation using dropout. They claim to get remarkably good results by basically doing the simple and obvious things.

## As dimension reduction

e.g. GP-LVM (N. Lawrence 2005). 🏗

## Readings

This lecture by the late David Mackay is probably good; the man could talk.

There is also a well-illustrated and elementary introduction by Yuge Shi.

## Implementations

Bayes workhorse Stan can do Gaussian Process regression just like almost everything else; see Michael Betancourt’s blog, 1. 2. 3.

### python

The current scikit-learn has basic Gaussian processes, and an introduction.

Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems.

The advantages of Gaussian processes are:

- The prediction interpolates the observations (at least for regular kernels).
- The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
- Versatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of Gaussian processes include:

- They are not sparse, i.e., they use the whole samples/features information to perform the prediction.
- They lose efficiency in high dimensional spaces — namely when the number of features exceeds a few dozens.

This description disposes me against using this implementation, because it has at best imprecision and at worst mistakes in the disadvantages section.

The first point is *strictly* correct, but not useful, in that sparse approximate GPs is a whole industry, and this is a statement about their implementation missing a commonly-used approximate inference method rather than a problem with the family of methods per se.
The second is just wrong, unless I have misunderstood something.
Cost scaling should be linear in the dimension of the feature space, as with all other kernel methods.
As such scaling costs due to dimensionality of the features is negligible compared to scaling costs of the number of data points, AFAICT.
Naïve inference is \(\mathcal{O}(DN^3)\) for \(N\) observations and \(D\) features.
Dimensionality cost is no worse than linear regression for prediction and superior for training, although other models that have a linear complexity in sample dimension escape without such a warning.

There are fancier Gaussian process tool sets than this one, with less worrisome caveats. Chris Fonnesbeck mentions GPflow (using tensorflow), autogp (also using tensorflow), GPytorch uses pytorch.

The GpFlow docs includes the following clarification of the genealogy of these toolkits.

GPflow has origins in GPy…, and much of the interface is intentionally similar for continuity (though some parts of the interface may diverge in future). GPflow has a rather different remit from GPy though:

- GPflow leverages TensorFlow for faster/bigger computation
- GPflow has much less code than GPy, mostly because all gradient computation is handled by TensorFlow.
- GPflow focusses on variational inference and MCMC — there is no expectation propagation or Laplace approximation.
- GPflow does not have any plotting functionality.

There are additionally several GP models in the pytorch-based pyro. PyMC3. I think that GPy is a common default choice and GPFlow, for example, has attempted to follow its API. ladax is a jax-based one. George is another python GP regression that claims to handle big data at the cost of lots of C++.

### julia

Stheno seems to be popular for Julia and also comes in an alternative flavour, python stheno. AugmentedGaussianProcess.jl by Théo Galy-Fajou looks nice and has sparse approximation plus some nice variational approx tricks.

There is the similar-looking GaussianProcesses.jl, although that, last time I looked (2019?) seemed to conflate model training and inference in an incovenient way so I have not used it.

A comparison of some options is done by Théo Galy-Fajou.

So… It’s easy enough to be bikeshedded is the message I’m getting here. There is an attempt at unifying the julia ecosystem in this area via AbstractGPs and other tools in JuliaGaussianProcesses organisation, e.g. KernelFunctions.jl.

### MATLAB

Should mention the various matlab/scilab options.

GPStuff is the one for MATLAB/Octave that I have seen around the place.

*Canadian Journal of Statistics*26 (1): 127–37. https://doi.org/10.2307/3315678.

*Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence*, 2–9. UAI ’04. Arlington, Virginia, United States: AUAI Press. http://arxiv.org/abs/1207.4131.

*Probability Theory and Related Fields*138 (1-2): 33–73. https://doi.org/10.1007/s00440-006-0011-8.

*Proceedings of the 20th International Conference on Neural Information Processing Systems*, 153–60. NIPS’07. USA: Curran Associates Inc. http://dl.acm.org/citation.cfm?id=2981562.2981582.

*Journal of Machine Learning Research*20 (117): 1–63. http://arxiv.org/abs/1609.00577.

*Journal of Machine Learning Research*21 (131): 1–63. http://jmlr.org/papers/v21/19-1015.html.

*2016 International Joint Conference on Neural Networks (IJCNN)*, 3338–45. Vancouver, BC, Canada: IEEE. https://doi.org/10.1109/IJCNN.2016.7727626.

*Mathematical Geology*22 (3): 239–52. https://doi.org/10.1007/BF00889887.

*Statistics for Spatio-Temporal Data*. Wiley Series in Probability and Statistics 2.0. John Wiley and Sons. http://books.google.com?id=4L_dCgAAQBAJ.

*Neural Computation*14 (3): 641–68. https://doi.org/10.1162/089976602317250933.

*Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic*, 657–63. NIPS’01. Cambridge, MA, USA: MIT Press. http://papers.nips.cc/paper/2027-tap-gibbs-free-energy-belief-propagation-and-sparsity.pdf.

*Proceedings of the 25th International Conference on Machine Learning*, 192–99. ICML ’08. New York, NY, USA: ACM Press. https://doi.org/10.1145/1390156.1390181.

*PMLR*. http://proceedings.mlr.press/v70/cutajar17a.html.

*Data Analytics for Renewable Energy Integration: Informing the Generation and Distribution of Renewable Energy*, edited by Wei Lee Woon, Zeyar Aung, Oliver Kramer, and Stuart Madnick, 94–106. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-71643-5_9.

*Artificial Intelligence and Statistics*, 207–15. http://proceedings.mlr.press/v31/damianou13a.html.

*Advances in Neural Information Processing Systems 24*, edited by J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, 2510–18. Curran Associates, Inc. http://papers.nips.cc/paper/4330-variational-gaussian-process-dynamical-systems.pdf.

*Advances in Neural Information Processing Systems 28*, 1414–22. NIPS’15. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2969239.2969397.

*Journal of Machine Learning Research*19 (1): 2100–2145. http://jmlr.org/papers/v19/18-015.html.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5309–19. Curran Associates, Inc. http://papers.nips.cc/paper/7115-identification-of-gaussian-process-state-space-models.pdf.

*Mathematical Geology*39 (6): 607–23. https://doi.org/10.1007/s11004-007-9112-x.

*Journal of Machine Learning Research*6: 615–37. http://www.jmlr.org/papers/v6/evgeniou05a.html.

*The Annals of Statistics*1 (2, 2): 209–30. https://doi.org/10.1214/aos/1176342360.

*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 3680–88. Curran Associates, Inc. http://papers.nips.cc/paper/5375-variational-gaussian-process-state-space-models.pdf.

*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3156–64. Curran Associates, Inc. http://papers.nips.cc/paper/5085-bayesian-inference-and-learning-in-gaussian-process-state-space-models-with-particle-mcmc.pdf.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*. http://arxiv.org/abs/1506.02142.

*Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences*371 (1984): 20110553. https://doi.org/10.1098/rsta.2011.0553.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*37 (2): 424–36. https://doi.org/10.1109/TPAMI.2013.192.

*Proceedings of the 22nd International Conference on Machine Learning - ICML ’05*, 241–48. Bonn, Germany: ACM Press. https://doi.org/10.1145/1102351.1102382.

*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–37. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_38-1.

*Proceedings of the Conference on Uncertainty in Artificial Intelligence*. http://arxiv.org/abs/1210.4856.

*2010 IEEE International Workshop on Machine Learning for Signal Processing*, 379–84. Kittila, Finland: IEEE. https://doi.org/10.1109/MLSP.2010.5589113.

*Uncertainty in Artificial Intelligence*, 282. Citeseer.

*Pattern Recognition Letters*45 (August): 85–91. https://doi.org/10.1016/j.patrec.2014.03.004.

*Conference on Uncertainty in Artificial Intelligence*, 789–98. PMLR. http://proceedings.mlr.press/v124/jankowiak20a.html.

*Learning in Graphical Models*. Cambridge, Mass.: MIT Press.

*2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP)*, 1–6. Vietri sul Mare, Salerno, Italy: IEEE. https://doi.org/10.1109/MLSP.2016.7738812.

*ICLR 2014 Conference*. http://arxiv.org/abs/1312.6114.

*Autonomous Robots*27 (1): 75–90. https://doi.org/10.1007/s10514-009-9119-x.

*Mathematical and Computer Modelling of Dynamical Systems*11 (4): 411–24. https://doi.org/10.1080/13873950500068567.

*Uai17*. http://arxiv.org/abs/1610.05392.

*Journal of the Southern African Institute of Mining and Metallurgy*52 (6): 119–39. https://journals.co.za/content/saimm/52/6/AJA0038223X_4792.

*Journal of Machine Learning Research*6 (Nov, Nov): 1783–1816. http://www.jmlr.org/papers/v6/lawrence05a.html.

*Proceedings of the 26th Annual International Conference on Machine Learning*, 601–8. ICML ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1553374.1553452.

*Proceedings of the 16th Annual Conference on Neural Information Processing Systems*, 609–16. http://papers.nips.cc/paper/2240-fast-sparse-gaussian-process-methods-the-informative-vector-machine.

*Journal of Machine Learning Research*11: 1865–81. http://www.jmlr.org/papers/v11/lazaro-gredilla10a.

*Journal of Computational and Graphical Statistics*12 (3): 548–65. https://doi.org/10.1198/1061860032021.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*73 (4): 423–98. https://doi.org/10.1111/j.1467-9868.2011.00777.x.

*IEEE Transactions on Signal Processing*59 (7, 7): 3155–67. https://doi.org/10.1109/TSP.2011.2119315.

*Twenty-Eighth AAAI Conference on Artificial Intelligence*. http://arxiv.org/abs/1402.4304.

*NATO ASI Series. Series F: Computer and System Sciences*, 133–65. http://www.inference.phy.cam.ac.uk/mackay/gpB.pdf.

*Information Theory, Inference & Learning Algorithms*, Chapter 45. Cambridge University Press. http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/534.548.pdf.

*Traité de géostatistique Appliquée. 2. Le Krigeage*. Editions Technip.

*Economic Geology*58 (8): 1246–66. https://doi.org/10.2113/gsecongeo.58.8.1246.

*Journal of Process Control*, DYCOPS-CAB 2016, 60 (December): 82–94. https://doi.org/10.1016/j.jprocont.2017.06.010.

*Proceedings of ICLR*. http://arxiv.org/abs/1511.06644.

*Journal of Machine Learning Research*6 (Jul, Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.

*Neural Computation*17 (1, 1): 177–204. https://doi.org/10.1162/0899766052530802.

*SSRN Electronic Journal*. https://doi.org/10.2139/ssrn.3159687.

*International Conference on Machine Learning*, 3789–98. http://proceedings.mlr.press/v80/nickisch18a.html.

*Journal of the Royal Statistical Society: Series B (Methodological)*40 (1): 1–24. https://doi.org/10.1111/j.2517-6161.1978.tb01643.x.

*Biometrika*99 (3): 511–31. https://doi.org/10.1093/biomet/ass034.

*Journal of Machine Learning Research*6 (December): 1939–59. http://jmlr.org/papers/volume6/quinonero-candela05a/quinonero-candela05a.pdf.

*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.

*2010 13th International Conference on Information Fusion*, 1–9. https://doi.org/10.1109/ICIF.2010.5711863.

*Proceedings of the 27th International Conference on International Conference on Machine Learning*, 927–34. ICML’10. Madison, WI, USA: Omnipress. http://publications.eng.cam.ac.uk/345777/.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1705.08933.

*International Conference on Artificial Intelligence and Statistics*, 689–97. http://arxiv.org/abs/1803.09151.

*Artificial Neural Networks and Machine Learning – ICANN 2011*, edited by Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel Kaski, 6792:151–58. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-21738-8_20.

*Bayesian Filtering and Smoothing*. Institute of Mathematical Statistics Textbooks 3. Cambridge, U.K. ; New York: Cambridge University Press. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.461.4042&rep=rep1&type=pdf.

*Artificial Intelligence and Statistics*. http://www.jmlr.org/proceedings/papers/v22/sarkka12.html.

*IEEE Signal Processing Magazine*30 (4, 4): 51–61. https://doi.org/10.1109/MSP.2013.2246292.

*Proceedings of the 31st International Conference on Neural Information Processing Systems*, 1696–706. NIPS’17. Red Hook, NY, USA: Curran Associates Inc. http://papers.nips.cc/paper/6767-reliable-decision-support-using-counterfactual-models.pdf.

*Artificial Intelligence and Statistics*, 877–85. PMLR. http://proceedings.mlr.press/v33/shah14.html.

*Advances in Neural Information Processing Systems*, 1257–64. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_543.pdf.

*Statistics and Computing*30 (2): 419–46. https://doi.org/10.1007/s11222-019-09886-w.

*International Conference on Artificial Intelligence and Statistics*, 567–74. PMLR. http://proceedings.mlr.press/v5/titsias09a.html.

*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*, 844–51. http://proceedings.mlr.press/v9/titsias10a.html.

*Journal of Computational and Graphical Statistics*16 (3): 633–55. https://doi.org/10.1198/106186007X210206.

*IEEE Transactions on Signal Processing*62 (23): 6171–83. https://doi.org/10.1109/TSP.2014.2362100.

*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*, 868–75. http://proceedings.mlr.press/v9/turner10a.html.

*Journal of Machine Learning Research*14 (April): 1175−1179. http://jmlr.csail.mit.edu/papers/v14/vanhatalo13a.html.

*Computer Graphics Forum*25 (3): 635–44. https://doi.org/10.1111/j.1467-8659.2006.00983.x.

*Proceedings of the 25th International Conference on Machine Learning*, 1112–19. ICML ’08. New York, NY, USA: ACM. https://doi.org/10.1145/1390156.1390296.

*Spatio-Temporal Statistics with R*.

*NIPS 2014 Workshop on Advances in Variational Inference*.

*Advances in Neural Information Processing Systems*, 682–88. http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.

*Advances in Neural Information Processing Systems 21*, edited by D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 265–72. Curran Associates, Inc. http://papers.nips.cc/paper/3385-multi-task-gaussian-process-learning-of-robot-inverse-dynamics.pdf.

*International Conference on Machine Learning*. http://arxiv.org/abs/1302.4245.

*Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence*, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=3020548.3020633.

*Machine Learning and Knowledge Discovery in Databases*, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*Proceedings of the 29th International Coference on International Conference on Machine Learning*, 1139–46. ICML’12. Madison, WI, USA: Omnipress. http://arxiv.org/abs/1110.4411.

*Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37*, 1775–84. ICML’15. Lille, France: JMLR.org. http://proceedings.mlr.press/v37/wilson15.html.