Covariance kernel zoo


What follows are some useful kernels to have in my toolkit, mostly over \(\mathbb{R}^n\) or at least some space with a metric. There are many more than I could fit here, over many more spaces than I need. (Real vectors, strings, other kernels, probability distributions etc.)

For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter (D. Duvenaud 2014). Also wikipedia and (Abrahamsen 1997; Genton 2001).

Stationary

A popular assumption, more or less implying implies that no region of the process is special. In this case the kernel may be written as a function purely of the distance between observations, i.e.

\[K(s,t)=K(\|s-t\|)\] for some distance \(\|\cdot\|\) between the observation coordinates.

Wiener process kernel

From the naming, we might suspect that a Gaussian process would also describe a standard Wiener process, which after all is a process with Gaussian increments, which is certain type of dependence. It is over a boring index space, time \(t\in \mathbb{R}\), but there is indeed nothing stopping us.

We can read this right off the Wiener process Wikipedia page. For a Gaussian process \(\{W_t\}_{t\in\mathbb{R}},\)

\[ {\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t} \]

Here \(s \wedge t\) here means “the minimum of \(s\) and \(t\)”.

That result is standard. From it we can immediately construct the kernel \(K(s,t):=s \wedge t\).

Causal kernels

Time-indexed processes are more general than a standard Wiener process. 🏗

What constraints make a covariance kernel causal?

Markov kernels

How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. Any kernel \(k(s,t)=k(s\wedge t)\) is clearly Markov. Are there more general ones? TODO: relate to kernels of bounded support. 🏗

Squared exponential

A.k.a. exponentiated quadratic. Often radial basis functions mean this also.

The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.

\[k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)\]

Rational Quadratic

Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.

\[k_{\textrm{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{(x - x')^2}{2 \alpha \ell^2} \right)^{-\alpha}\]

Note that \(\lim_{\alpha\to\infty} k_{\textrm{RQ}}= k_{\textrm{SE}}\).

Matérn

The Matérn stationary (and in the Euclidean case, isotropic) covariance function is one model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook (Rasmussen and Williams 2006).

\[ k_{\textrm{Mat}}(x, x')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right) \]

where \(\Gamma\) is the gamma function, \(\ K_{\nu }\) is the modified Bessel function of the second kind, and \(\rho,\nu\geq 0\).

AFAICT you use this for covariances hypothesised to be less smooth than the squared exponential covariance. And other things?

Periodic

\[ k_{\textrm{Per}}(x, x') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \]

Locally periodic

This is an example of a composed kernel, explained below.

\[\begin{aligned} k_{\textrm{LocPer}}(x, x') &= k_{\textrm{Per}}(x, x')k_{\textrm{SE}}(x, x') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell^2}\right) \end{aligned}\]

Obviously there are other possible localisations of a periodic kernel. This is a Locally Periodic kernel.

“Integral” kernel

I just noticed the ambiguously named Integral kernel:

I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.

Examples include:

  • Knowing how far a robot has travelled after 2, 4, 6 and 8 seconds, but wanting an estimate of its speed after 5 seconds…
  • Wanting to know an estimate of the density of people aged 23, when we only have the total count for binned age ranges…

I would argue that all kernels are naturally defined in terms of integrals, but the author seems to mean something particular. I suspect I would call this a sampling kernel, but that name is also overloaded. Anyway, what is actually going on here? Where is it introduced? Possibly one of (Smith, Alvarez, and Lawrence 2018; O’Callaghan and Ramos 2011; Murray-Smith and Pearlmutter 2005).

Composing kernels

A sum or product (or outer sum, or tensor product) of kernels is still a kernel. For other transforms YMMV.

For example, in the case of Gaussian processes, suppose that, independently,

\[\begin{aligned} f_{1} &\sim \mathcal{GP}\left(\mu_{1}, k_{1}\right)\\ f_{2} &\sim \mathcal{GP}\left(\mu_{2}, k_{2}\right) \end{aligned}\] then

\[ f_{1}+f_{2} \sim \mathcal{GP} \left(\mu_{1}+\mu_{2}, k_{1}+k_{2}\right) \] so \(k_{1}+k_{2}\) is also a kernel.

More generally, if \(k_{1}\) and \(k_{2}\) are two kernels, and \(c_{1}\), and \(c_{2}\) are two positive real numbers, then:

\[ K(x, x')=c_{1} k_{1}(x, x')+c_{2} k_{2}(x, x') \] is again a kernel. What with the multiplication as well, we note that all polynomials of kernels where the coefficients are positive are in turn kernels. (Genton 2001)

Note that the additivity in terms of kernels is not the same as additivity in terms of induced feature spaces. The induced feature map of \(k_{1}+k_{2}\) is their concatenation rather than their sum. Suppose \(\phi_{1}(x)\) gives us the feature map of \(k_{1}\) for \(x\) and likewise \(\phi_{2}(x)\).

\[\begin{aligned} k_{1}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x') \\ k_{2}(x, x') &=\phi_{2}(x)^{\top} \phi_{2}(x')\\ k_{1}(x, x')+k_{2}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x')+\phi_{2}(x)^{\top} \phi_{2}(x')\\ &=\left[\begin{array}{c}{\phi_{1}(x)} \\ {\phi_{2}(x)}\end{array}\right]^{\top} \left[\begin{array}{c}{\phi_{1}(x')} \\ {\phi_{2}(x')}\end{array}\right] \end{aligned}\]

If \(k:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\) is a kernel and \(\psi: \mathcal{X}\to\mathcal{Y}\) this is also a kernel

\[\begin{aligned} k_{\psi}:&\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\ & (x,x')\mapsto k_{y}(\psi(x), \psi(x')) \end{aligned}\]

which apparently is now called a deep kernel. if \(k\) is stationary and \(\psi\) is invertible then this is a stationary reducible kernel.

Also if \(A\) is a positive definite operator, then of course it defines a kernel \(k_A(x,x'):=x^{\top}Ax'\)

(Genton 2001) uses the properties of covariance to construct some other nifty ones:

Let \(h:\mathcal{X}\to\mathbb{R}^{+}\) have minimum at 0. Then, using the identity for RVs

\[ \mathop{\textrm{Cov}}\left(Y_{1}, Y_{2}\right)=\left[\mathop{\textrm{Var}}\left(Y_{1}+Y_{2}\right)-\mathop{\textrm{Var}}\left(Y_{1}-Y_{2}\right)\right] / 4 \]

we find that the following is a kernel

\[ K(x, x')=\frac{1}{4}[h(x+x')-h(x-x')] \]

All these to various cunning combination strategies, which I may return to discuss. 🏗 Some of them are in the references. For example (D. Duvenaud et al. 2013) position their work in the wider field.

There is a large body of work attempting to construct a rich kernel through a weighted sum of base kernels (e.g. (Bach 2008; Christoudias, Urtasun, and Darrell 2009)). While these approaches find the optimal solution in polynomial time, speed comes at a cost: the component kernels, as well as their hyperparameters, must be specified in advance […]

(Hinton and Salakhutdinov 2008) use a deep neural network to learn an embedding; this is a flexible approach to kernel learning but relies upon finding structure in the input density, p(x). Instead we focus on domains where most of the interesting structure is in f(x).

(Wilson and Adams 2013) derive kernels of the form \(SE × \cos(x − x_0\)), forming a basis for stationary kernels. These kernels share similarities with \(SE × Per\) but can express negative prior correlation, and could usefully be included in our grammar.

See (Grosse et al. 2012) for a mind-melting compositional matrix factorization diagram, constructing a search over hierarchical kernel decompositions.

Exploiting compositionality to explore a large space of model structures

Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.

Stationary spectral kernels

(Sun et al. 2018; Bochner 1959; Kom Samo and Roberts 2015; Yaglom 1987) construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary. See Bochner’s theorem.

Non-stationary spectral kernels

(Sun et al. 2018; Remes, Heinonen, and Kaski 2017; Kom Samo and Roberts 2015) use a generalised Bochner Theorem (Yaglom 1987) often called Yaglom’s Theorem, which does not presume stationarity. See Yaglom’s theorem.

It is not immediately clear how to use this; spectral representations are not an intuitive way of constructing things.

Locally stationary kernels

(Genton 2001) defines these are kernels that have a particular structure, specifically, a kernel that can be factored into a stationary kernel (_2) and a non negative function \(K_1\) in the following way:

\[ K(\mathbf{s}, \mathbf{t})=K_{1}\left(\frac{\mathbf{s}+\mathbf{t}}{2}\right) K_{2}(\mathbf{s}-\mathbf{t}) \]

Global structure then depends on the mean location \(\frac{\mathbf{s}+\mathbf{t}}{2}\). (Genton 2001) describes some nifty spectral properties of these kernels.

Other constructions might vie for the title of “locally stationary”. To check. 🏗

Compactly supported

We usually think about compactly supported kernels in the stationary isotropic case, where we mean kernels that vanish whenever the distance between two observation \(s,t\) is larger than a certain cut-off distance \(L,\) i.e. \(\|s-t\|>L\Rightarrow K(s,t)=0\). These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without good reason, so maybe it is that?

(Genton 2001) mentions

\[ \max \left\{\left(1-\frac{\|\mathbf{s}-\mathbf{t}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\} \] and handballs us to (Gneiting 2002b) for a bigger smörgåsbord of stationary compactly supported kernels. Gneiting (2002b) article has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly concerns producing compactly supported kernels via clever integral transforms.

Genton kernels

That’s my name for them because they seem to originate in (Genton 2001).

For any non-negative function \(h:\mathcal{T}\to\mathbb{R}^+\) with \(h(\mathbf{0})=0,\) the following is a kernel:

\[ K(\mathbf{s}, \mathbf{t})=\frac{1}{4}[h(\mathbf{s}+\mathbf{t})-h(\mathbf{s}-\mathbf{t}) \] Genton gives the example of \(h:s,t\mapsto s^\top t.\)

The motivation is the identity

\[ \operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \left[\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)\right] / 4. \]

Kernels with desired symmetry

(D. Duvenaud 2014, chap. 2) summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🏗 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.

Stationary reducible kernels

See kernel warping.

References

Abrahamsen, Petter. 1997. “A Review of Gaussian Random Fields and Correlation Functions.” http://publications.nr.no/publications.nr.no/directdownload/publications.nr.no/rask/old/917_Rapport.pdf.
Agarwal, Arvind, and Hal Daumé Iii. 2011. “Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.
Alexanderian, Alen. 2015. “A Brief Note on the Karhunen-Loève Expansion.” October 26, 2015. http://arxiv.org/abs/1509.07526.
Aronszajn, N. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68 (3, 3): 337–404. https://doi.org/10.2307/1990404.
Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. “Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning 4 (3, 3): 195–266. https://doi.org/10.1561/2200000036.
Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.
Bacry, Emmanuel, and Jean-François Muzy. 2016. “First- and Second-Order Statistics Characterization of Hawkes Processes and Non-Parametric Estimation.” IEEE Transactions on Information Theory 62 (4, 4): 2184–2202. https://doi.org/10.1109/TIT.2016.2533397.
Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. “The Mondrian Kernel.” June 16, 2016. http://arxiv.org/abs/1606.05241.
Bochner, Salomon. 1959. Lectures on Fourier Integrals. Princeton University Press. http://books.google.com?id=MWCYDwAAQBAJ.
Bohn, Bastian, Michael Griebel, and Christian Rieger. 2018. “A Representer Theorem for Deep Kernel Learning.” June 7, 2018. http://arxiv.org/abs/1709.10441.
Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2019. “Generic Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research 20 (117): 1–63. http://arxiv.org/abs/1609.00577.
Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. “Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html.
Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. “Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research 5 (December): 1035–62. http://dl.acm.org/citation.cfm?id=1005332.1016793.
Cressie, Noel, and Hsin-Cheng Huang. 1999. “Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” Journal of the American Statistical Association 94 (448, 448): 1330–39. https://doi.org/10.1080/01621459.1999.10473885.
Delft, Anne van, and Michael Eichler. 2016. “Locally Stationary Functional Time Series.” February 16, 2016. http://arxiv.org/abs/1602.05125.
Duvenaud, David. 2014. “Automatic Model Construction with Gaussian Processes.” PhD Thesis, University of Cambridge. https://doi.org/10.17863/CAM.14087.
Duvenaud, David K., Hannes Nickisch, and Carl E. Rasmussen. 2011. “Additive Gaussian Processes.” In Advances in Neural Information Processing Systems, 226–34. http://papers.nips.cc/paper/4221-additive-gaussian-processes.pdf.
Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.
Gaspari, Gregory, and Stephen E. Cohn. 1999. “Construction of Correlation Functions in Two and Three Dimensions.” Quarterly Journal of the Royal Meteorological Society 125 (554): 723–57. https://doi.org/10.1002/qj.49712555417.
Genton, Marc G. 2001. “Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research 2 (December): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.
Genton, Marc G., and Olivier Perrin. 2004. “On a Time Deformation Reducing Nonstationary Stochastic Processes to Local Stationarity.” Journal of Applied Probability 41 (1, 1): 236–49. https://doi.org/10.1239/jap/1077134681.
Girolami, Mark, and Simon Rogers. 2005. “Hierarchic Bayesian Models for Kernel Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05, 241–48. Bonn, Germany: ACM Press. https://doi.org/10.1145/1102351.1102382.
Gneiting, Tilmann. 2002a. “Nonseparable, Stationary Covariance Functions for SpaceTime Data.” Journal of the American Statistical Association 97 (458): 590–600. https://doi.org/10.1198/016214502760047113.
———. 2002b. “Compactly Supported Correlation Functions.” Journal of Multivariate Analysis 83 (2): 493–508. https://doi.org/10.1006/jmva.2001.2056.
Gneiting, Tilmann, William Kleiber, and Martin Schlather. 2010. “Matérn Cross-Covariance Functions for Multivariate Random Fields.” Journal of the American Statistical Association 105 (491, 491): 1167–77. https://doi.org/10.1198/jasa.2010.tm09420.
Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence. http://arxiv.org/abs/1210.4856.
Hartikainen, J., and S. Särkkä. 2010. “Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In 2010 IEEE International Workshop on Machine Learning for Signal Processing, 379–84. Kittila, Finland: IEEE. https://doi.org/10.1109/MLSP.2010.5589113.
Hawkes, Alan G. 1971. “Spectra of Some Self-Exciting and Mutually Exciting Point Processes.” Biometrika 58 (1): 83–90. https://doi.org/10.1093/biomet/58.1.83.
Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. “Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc. http://papers.nips.cc/paper/3211-using-deep-belief-nets-to-learn-covariance-kernels-for-gaussian-processes.pdf.
Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. “Kernel Methods in Machine Learning.” The Annals of Statistics 36 (3, 3): 1171–1220. https://doi.org/10.1214/009053607000000677.
Kom Samo, Yves-Laurent, and Stephen Roberts. 2015. “Generalized Spectral Kernels.” June 7, 2015. http://arxiv.org/abs/1506.02236.
Kondor, Risi, and Tony Jebara. 2006. “Gaussian and Wishart Hyperkernels.” In Proceedings of the 19th International Conference on Neural Information Processing Systems, 729–36. NIPS’06. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2976456.2976548.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In Uai17. http://arxiv.org/abs/1610.05392.
Lawrence, Neil. 2005. “Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research 6 (Nov, Nov): 1783–1816. http://www.jmlr.org/papers/v6/lawrence05a.html.
Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/1402.4304.
Mercer, J. 1909. “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441-458, 441-458): 415–46. https://doi.org/10.1098/rsta.1909.0016.
Micchelli, Charles A., and Massimiliano Pontil. 2005a. “Learning the Kernel Function via Regularization.” Journal of Machine Learning Research 6 (Jul, Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.
———. 2005b. “On Learning Vector-Valued Functions.” Neural Computation 17 (1, 1): 177–204. https://doi.org/10.1162/0899766052530802.
Minasny, Budiman, and Alex. B. McBratney. 2005. “The Matérn Function as a General Model for Soil Variograms.” Geoderma, Pedometrics 2003, 128 (3–4, 3–4): 192–207. https://doi.org/10.1016/j.geoderma.2005.04.003.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. 1 edition. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.
Murray-Smith, Roderick, and Barak A. Pearlmutter. 2005. “Transformations of Gaussian Process Priors.” In Deterministic and Statistical Methods in Machine Learning, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg. http://bcl.hamilton.ie/~barak/papers/MLW-Jul-2005.pdf.
O’Callaghan, Simon Timothy, and Fabio T. Ramos. 2011. “Continuous Occupancy Mapping with Integral Kernels.” In Twenty-Fifth AAAI Conference on Artificial Intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3784.
Ong, Cheng Soon, Xavier Mary, Stéphane Canu, and Alexander J. Smola. 2004. “Learning with Non-Positive Kernels.” In Twenty-First International Conference on Machine Learning - ICML ’04, 81. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015443.
Ong, Cheng Soon, and Alexander J. Smola. 2003. “Machine Learning Using Hyperkernels.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 568–75. ICML’03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041910.
Ong, Cheng Soon, Alexander J. Smola, and Robert C. Williamson. 2002. “Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2968618.2968680.
———. 2005. “Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research 6 (Jul, Jul): 1043–71. http://www.jmlr.org/papers/v6/ong05a.html.
Perrin, Olivier, and Rachid Senoussi. 1999. “Reducing Non-Stationary Stochastic Processes to Stationarity by a Time Deformation.” Statistics & Probability Letters 43 (4): 393–97. https://doi.org/10.1016/S0167-7152(98)00278-8.
———. 2000. “Reducing Non-Stationary Random Fields to Stationarity and Isotropy Using a Space Deformation.” Statistics & Probability Letters 48 (1): 23–32. https://doi.org/10.1016/S0167-7152(99)00188-1.
Pérez-Abreu, Víctor, and Alfonso Rocha-Arteaga. 2005. “Covariance-Parameter Lévy Processes in the Space of Trace-Class Operators.” Infinite Dimensional Analysis, Quantum Probability and Related Topics 08 (01): 33–54. https://doi.org/10.1142/S0219025705001846.
Pfaffel, Oliver. 2012. “Wishart Processes.” January 16, 2012. http://arxiv.org/abs/1201.3256.
Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. SimpleMKL.” Journal of Machine Learning Research 9: 2491–521. http://www.jmlr.org/papers/v9/rakotomamonjy08a.html.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.
Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. “Non-Stationary Spectral Kernels.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc. http://papers.nips.cc/paper/7050-non-stationary-spectral-kernels.pdf.
———. 2018. “Neural Non-Stationary Spectral Kernel.” November 27, 2018. http://arxiv.org/abs/1811.10978.
Sampson, Paul D., and Peter Guttorp. 1992. “Nonparametric Estimation of Nonstationary Spatial Covariance Structure.” Journal of the American Statistical Association 87 (417): 108–19. https://doi.org/10.1080/01621459.1992.10475181.
Särkkä, Simo, and Jouni Hartikainen. 2012. “Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In Artificial Intelligence and Statistics. http://www.jmlr.org/proceedings/papers/v22/sarkka12.html.
Särkkä, Simo, A. Solin, and J. Hartikainen. 2013. “Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine 30 (4, 4): 51–61. https://doi.org/10.1109/MSP.2013.2246292.
Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. “A Generalized Representer Theorem.” In Computational Learning Theory, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44581-1.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
———. 2003. “A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.
Sinha, Aman, and John C Duchi. 2016. “Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1298–1306. Curran Associates, Inc. http://papers.nips.cc/paper/6180-learning-kernels-with-random-features.pdf.
Smith, Michael Thomas, Mauricio A. Alvarez, and Neil D. Lawrence. 2018. “Gaussian Process Regression for Binned Data.” September 6, 2018. http://arxiv.org/abs/1809.02010.
Stein, Michael L. 2005. “Space-Time Covariance Functions.” Journal of the American Statistical Association 100 (469): 310–21. https://doi.org/10.1198/016214504000000854.
Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” 2018. http://arxiv.org/abs/1806.04326.
Székely, Gábor J., and Maria L. Rizzo. 2009. “Brownian Distance Covariance.” The Annals of Applied Statistics 3 (4, 4): 1236–65. https://doi.org/10.1214/09-AOAS312.
Uziel, Guy. 2020. “Nonparametric Sequential Prediction While Deep Learning the Kernel.” In International Conference on Artificial Intelligence and Statistics, 111–21. PMLR. http://proceedings.mlr.press/v108/uziel20b.html.
Vedaldi, A., and A. Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3, 3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.
Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. “A Primer on Kernel Methods.” In Kernel Methods in Computational Biology. MIT Press. http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2549.pdf.
Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. “Graph Kernels.” Journal of Machine Learning Research 11 (August): 1201–42. http://authors.library.caltech.edu/20528/1/Vishwanathan2010p11646J_Mach_Learn_Res.pdf.
Wilk, Mark van der, Andrew G. Wilson, and Carl E. Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.
Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning. http://arxiv.org/abs/1302.4245.
Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. “The Human Kernel.” October 26, 2015. http://arxiv.org/abs/1510.07389.
Wilson, Andrew Gordon, and Zoubin Ghahramani. 2011. “Generalised Wishart Processes.” In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=3020548.3020633.
———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2016. “Deep Kernel Learning.” In Artificial Intelligence and Statistics, 370–78. PMLR. http://proceedings.mlr.press/v51/wilson16.html.
Wu, Zongmin. 1995. “Compactly Supported Positive Definite Radial Functions.” Advances in Computational Mathematics 4 (1): 283–92. https://doi.org/10.1007/BF03177517.
Yaglom, A. M. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics. New York, NY: Springer Science & Business Media.
Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. “Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.
Zhang, Aonan, and John Paisley. 2019. “Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning, 7424–33. http://proceedings.mlr.press/v97/zhang19k.html.