Kernel zoo

September 16, 2019 — March 30, 2021

Hilbert space
kernel tricks
metrics
signal processing
statistics
stochastic processes
Figure 1

What follows are some useful kernels to have in my toolkit, mostly over \(\mathbb{R}^n\) or at least some space with a metric. There are many more than I could fit here, of course. And kernels are defined over many spaces, real vectors, strings, other kernels, probability distributions etc.

For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter (D. Duvenaud 2014). Also Wikipedia and (Abrahamsen 1997; Genton 2001).

TODO: kernel Venn diagram.

1 Stationary

A popular assumption, more or less implying that no region of the process is special. In this case, the kernel may be written as a function purely of the distance between \[K(\mathbf{x},t)=K(\|\mathbf{x}-t\|)\] for some distance \(\|\cdot\|\) between the observation coordinates. This kind of translation-invariant kernel is the default. They are conveniently analysed in terms of the Wiener-Khinchine theorem.

A general stationary kernel is the Hida-Matérn kernel (Dowling, Sokół, and Park 2021).

2 Dot-product

The kernel is a function of the inner product/dot product of the input coordinates, and we may overload notation to write \[K(\mathbf{x},\mathbf{y})=K(\mathbf{x}\cdot \mathbf{y}).\] Hard to google for because there is the confounding fact that kernels already define inner products in some other space so it’s a dot product defined in terms of a dot product.

Such kernels are rotation invariant but not stationary. Instead of the Fourier relationships that stationary kernels have, these can be described in Legendre bases for radial functions. Smola, Óvári, and Williamson (2000) manufacture some theorems which tell you whether your choice of inner product can in fact define an inner product kernel. Unfortunately, the basis so defined is long and ugly and not especially tractable to work with except maybe by computational algebra systems.

3 NN kernels

Infinite-width random NN kernels are nearly dot product kernels. They depend on several dot products, \(\mathbf{x}\cdot \mathbf{x}\), \(\mathbf{x}\cdot \mathbf{y}\) and \(\mathbf{y}\cdot \mathbf{y}\). See NN kernels.

4 Causal kernels

Time-indexed processes are more general than a standard Wiener process. 🏗

What constraints make a covariance kernel causal? This is not always easily expressed in terms of the covariance kernel; you want something like the inverse covariance/precision.

4.1 Wiener process kernel

The covariance kernel which is possessed by a standard Wiener process, which is a process with Gaussian increments, which is indeed a certain type of dependence. It is over a boring index space, time \(t\in \mathbb{R}\). We can read this right off the Wiener process Wikipedia page: For a Gaussian process \(\{W_t\}_{t\in\mathbb{R}},\)

\[ {\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t} \]

Here \(s \wedge t\) means “the minimum of \(s\) and \(t\)”. From it we can immediately construct the kernel \(K(s,t):=s \wedge t\).

5 Squared exponential

A.k.a. exponentiated quadratic. Often “radial basis functions” mean this also, although not always.

The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.

\[K_{\textrm{SE}}(\mathbf{x}, \mathbf{x}') = \sigma^2\exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{2\ell^2}\right)\]

6 Rational Quadratic

Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.

\[K_{\textrm{RQ}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \left( 1 + \frac{(\mathbf{x} - \mathbf{x}')^2}{2 \alpha \ell^2} \right)^{-\alpha}\]

Note that \(\lim_{\alpha\to\infty} K_{\textrm{RQ}}= K_{\textrm{SE}}\).

7 Matérn

The Matérn stationary (and in the Euclidean case, isotropic) covariance function is a surprisingly convenient model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook (Rasmussen and Williams 2006).

\[ K_{\textrm{Mat}}(\mathbf{x}, \mathbf{x}')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{\mathbf{x} - \mathbf{x}'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{\mathbf{x} - \mathbf{x}'}{\rho}\right) \]

where $ $ is the gamma function, \(\ K_{\nu }\) is the modified Bessel function of the second kind, and \(\rho,\nu\geq 0\). We may then read the differentiability of the solution directly off the parameterization.

8 Periodic

\[ K_{\textrm{Per}}(\mathbf{x}, \mathbf{x}') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right) \]

Figure 2

9 Locally periodic

This is an example of a composed kernel.

\[\begin{aligned} K_{\textrm{LocPer}}(\mathbf{x}, \mathbf{x}') &= K_{\textrm{Per}}(\mathbf{x}, \mathbf{x}')K_{\textrm{SE}}(\mathbf{x}, \mathbf{x}') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right) \exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{2\ell^2}\right) \end{aligned}\]

Obviously there are other possible localisations of a periodic kernel. This is a locally periodic kernel. NB it is not local in the sense of Genton’s local stationarity, just local in the sense that one kernel is ‘enveloped’ by another.

10 “Integral” kernel

I just noticed the ambiguously named Integral kernel:

I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.

Examples include:

  • Knowing how far a robot has travelled after 2, 4, 6 and 8 seconds, but wanting an estimate of its speed after 5 seconds…
  • Wanting to know an estimate of the density of people aged 23, when we only have the total count for binned age ranges…

I argue that all kernels are naturally defined in terms of integrals, but the author seems to mean something particular. I suspect I would call this a sampling kernel, but that name is also overloaded. Anyway, what is actually going on here? Where is it introduced? Possibly one of (Smith, Alvarez, and Lawrence 2018; O’Callaghan and Ramos 2011; Murray-Smith and Pearlmutter 2005).

11 Composed kernels

See composing kernels.

12 Stationary spectral kernels

(Sun et al. 2018; Bochner 1959; Kom Samo and Roberts 2015; Yaglom 1987b) construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary. You could think of this as a kind of limiting case of composing kernels with a Fourier basis. See Bochner’s theorem.

13 Nonstationary spectral kernels

(Sun et al. 2018; Remes, Heinonen, and Kaski 2017; Kom Samo and Roberts 2015) use a generalised Bochner Theorem (Yaglom 1987b) often called Yaglom’s Theorem, which does not presume stationarity. See Yaglom’s theorem.

It is not immediately clear how to use this; spectral representations are not an intuitive way of constructing things.

14 Compactly supported

We usually think about compactly supported kernels in the stationary isotropic case, where we mean kernels that vanish whenever the distance between two observations \(\mathbf{x},\mathbf{y}\) is larger than a certain cut-off distance \(L,\) i.e. \(\|\mathbf{x}-\mathbf{y}\|>L\Rightarrow K(\mathbf{x},\mathbf{y})=0\). These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without good reason, so maybe it is that?

Genton (2001) mentions \[ \max \left\{\left(1-\frac{\|\mathbf{x}-\mathbf{y}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\} \] and handballs us to Gneiting (2002b) for a bigger smörgåsbord of stationary compactly supported kernels. Gneiting (2002b) has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly concerns producing compactly supported kernels via clever integral transforms.

For inner product kernels, this can be diabolical. The Schaback and Wu method discusses some operations that preserve positive-definiteness.

NB if you are trying specifically to enforce sparsity here, it might be worth considering the kernel induced by a stochastic convolution, which is kind of a precision parameterisation.

15 Markov kernels

How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. Any kernel \(K(s,t)=K(s\wedge t)\) is clearly Markov. Are there more general ones? 🚧TODO🚧 clarify 🏗

16 Genton kernels

That’s my name for them because they seem to originate in (Genton 2001).

For any non-negative function \(h:\mathcal{T}\to\mathbb{R}^+\) with \(h(\mathbf{0})=0,\) the following is a kernel:

\[ K(\mathbf{x}, \mathbf{y})=\frac{[h(\mathbf{x}+\mathbf{y})-h(\mathbf{x}-\mathbf{y})}{4} \] Genton gives the example of \(h:\mathbf{x}\mapsto \|\mathbf{x}\|_2^2.\) instance, consider the function \(h(\mathbf{x})=\mathbf{x}^{\top} \mathbf{x} .\) From this we obtain the kernel: \[ K(\mathbf{x}, \mathbf{z})=\frac{1}{4}\left[(\mathbf{x}+\mathbf{z})^{\top}(\mathbf{x}+\mathbf{z})-(\mathbf{x}-\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z})\right]=\mathbf{x}^{\top} \mathbf{z} \] The motivation is the identity

\[ \operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \frac{\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)}{ 4}. \]

17 Kernels with desired symmetry

(D. Duvenaud 2014, chap. 2) summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🏗 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.

18 Stationary reducible kernels

See kernel warping.

19 Other nonstationary kernels

See nonstationary kernels.

20 References

Abrahamsen. 1997. A Review of Gaussian Random Fields and Correlation Functions.”
Agarwal, and Iii. 2011. Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Alexanderian. 2015. A Brief Note on the Karhunen-Loève Expansion.” arXiv:1509.07526 [Math].
Álvarez, Rosasco, and Lawrence. 2012. Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning.
Aronszajn. 1950. Theory of Reproducing Kernels.” Transactions of the American Mathematical Society.
Bach. 2008. Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08.
Bacry, and Muzy. 2016. First- and Second-Order Statistics Characterization of Hawkes Processes and Non-Parametric Estimation.” IEEE Transactions on Information Theory.
Balog, Lakshminarayanan, Ghahramani, et al. 2016. The Mondrian Kernel.” arXiv:1606.05241 [Stat].
Bochner. 1959. Lectures on Fourier Integrals.
Bohn, Griebel, and Rieger. 2018. A Representer Theorem for Deep Kernel Learning.” arXiv:1709.10441 [Cs, Math].
Bonilla, Krauth, and Dezfouli. 2019. Generic Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research.
Chen, Genton, and Sun. 2021. Space-Time Covariance Structures and Models.” Annual Review of Statistics and Its Application.
Cho, and Saul. 2009. Kernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS’09.
Christoudias, Urtasun, and Darrell. 2009. Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96.
Cortes, Haffner, and Mohri. 2004. Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research.
Cressie, and Huang. 1999. Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” Journal of the American Statistical Association.
Dowling, Sokół, and Park. 2021. Hida-Matérn Kernel.”
Duvenaud, David. 2014. Automatic Model Construction with Gaussian Processes.”
Duvenaud, David, Lloyd, Grosse, et al. 2013. Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Duvenaud, David K., Nickisch, and Rasmussen. 2011. Additive Gaussian Processes.” In Advances in Neural Information Processing Systems.
Gaspari, and Cohn. 1999. Construction of Correlation Functions in Two and Three Dimensions.” Quarterly Journal of the Royal Meteorological Society.
Genton. 2001. Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research.
Genton, and Perrin. 2004. On a Time Deformation Reducing Nonstationary Stochastic Processes to Local Stationarity.” Journal of Applied Probability.
Girolami, and Rogers. 2005. Hierarchic Bayesian Models for Kernel Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05.
Gneiting. 2002a. Nonseparable, Stationary Covariance Functions for Space–Time Data.” Journal of the American Statistical Association.
———. 2002b. Compactly Supported Correlation Functions.” Journal of Multivariate Analysis.
Gneiting, Kleiber, and Schlather. 2010. Matérn Cross-Covariance Functions for Multivariate Random Fields.” Journal of the American Statistical Association.
Grosse, Salakhutdinov, Freeman, et al. 2012. Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Hartikainen, and Särkkä. 2010. Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In 2010 IEEE International Workshop on Machine Learning for Signal Processing.
Hawkes. 1971. Spectra of Some Self-Exciting and Mutually Exciting Point Processes.” Biometrika.
Hinton, and Salakhutdinov. 2008. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In Advances in Neural Information Processing Systems 20.
Hofmann, Schölkopf, and Smola. 2008. Kernel Methods in Machine Learning.” The Annals of Statistics.
Kar, and Karnick. 2012. Random Feature Maps for Dot Product Kernels.” In Artificial Intelligence and Statistics.
Kom Samo, and Roberts. 2015. Generalized Spectral Kernels.” arXiv:1506.02236 [Stat].
Kondor, and Jebara. 2006. Gaussian and Wishart Hyperkernels.” In Proceedings of the 19th International Conference on Neural Information Processing Systems. NIPS’06.
Krauth, Bonilla, Cutajar, et al. 2016. AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.
Lawrence. 2005. Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research.
Lloyd, Duvenaud, Grosse, et al. 2014. Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Mercer. 1909. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character.
Micchelli, and Pontil. 2005a. Learning the Kernel Function via Regularization.” Journal of Machine Learning Research.
———. 2005b. On Learning Vector-Valued Functions.” Neural Computation.
Minasny, and McBratney. 2005. The Matérn Function as a General Model for Soil Variograms.” Geoderma, Pedometrics 2003,.
Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.
Murray-Smith, and Pearlmutter. 2005. Transformations of Gaussian Process Priors.” In Deterministic and Statistical Methods in Machine Learning. Lecture Notes in Computer Science.
O’Callaghan, and Ramos. 2011. Continuous Occupancy Mapping with Integral Kernels.” In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Ong, Mary, Canu, et al. 2004. Learning with Non-Positive Kernels.” In Twenty-First International Conference on Machine Learning - ICML ’04.
Ong, and Smola. 2003. Machine Learning Using Hyperkernels.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.
Ong, Smola, and Williamson. 2002. Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
———. 2005. Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research.
Paciorek, and Schervish. 2003. Nonstationary Covariance Functions for Gaussian Process Regression.” In Proceedings of the 16th International Conference on Neural Information Processing Systems. NIPS’03.
Pérez-Abreu, and Rocha-Arteaga. 2005. Covariance-Parameter Lévy Processes in the Space of Trace-Class Operators.” Infinite Dimensional Analysis, Quantum Probability and Related Topics.
Perrin, and Senoussi. 1999. Reducing Non-Stationary Stochastic Processes to Stationarity by a Time Deformation.” Statistics & Probability Letters.
———. 2000. Reducing Non-Stationary Random Fields to Stationarity and Isotropy Using a Space Deformation.” Statistics & Probability Letters.
Pfaffel. 2012. Wishart Processes.” arXiv:1201.3256 [Math].
Rakotomamonjy, Bach, Canu, et al. 2008. SimpleMKL.” Journal of Machine Learning Research.
Rasmussen, and Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning.
Remes, Heinonen, and Kaski. 2017. Non-Stationary Spectral Kernels.” In Advances in Neural Information Processing Systems 30.
———. 2018. Neural Non-Stationary Spectral Kernel.” arXiv:1811.10978 [Cs, Stat].
Sampson, and Guttorp. 1992. Nonparametric Estimation of Nonstationary Spatial Covariance Structure.” Journal of the American Statistical Association.
Särkkä, and Hartikainen. 2012. Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In Artificial Intelligence and Statistics.
Särkkä, Solin, and Hartikainen. 2013. Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine.
Schölkopf, Herbrich, and Smola. 2001. A Generalized Representer Theorem.” In Computational Learning Theory. Lecture Notes in Computer Science.
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
———. 2003. A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.
Sinha, and Duchi. 2016. Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29.
Smith, Alvarez, and Lawrence. 2018. Gaussian Process Regression for Binned Data.” arXiv:1809.02010 [Cs, Stat].
Smola, Óvári, and Williamson. 2000. Regularization with Dot-Product Kernels.” In Proceedings of the 13th International Conference on Neural Information Processing Systems. NIPS’00.
Stein. 2005. Space-Time Covariance Functions.” Journal of the American Statistical Association.
Sun, Zhang, Wang, et al. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Székely, and Rizzo. 2009. Brownian Distance Covariance.” The Annals of Applied Statistics.
Tsuchida, Roosta, and Gallagher. 2018. Invariance of Weight Distributions in Rectified MLPs.” In International Conference on Machine Learning.
———. 2019. Richer Priors for Infinitely Wide Multi-Layer Perceptrons.” arXiv:1911.12927 [Cs, Stat].
Uziel. 2020. Nonparametric Sequential Prediction While Deep Learning the Kernel.” In International Conference on Artificial Intelligence and Statistics.
van Delft, and Eichler. 2016. Locally Stationary Functional Time Series.” arXiv:1602.05125 [Math, Stat].
van der Wilk, Wilson, and Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.
Vedaldi, and Zisserman. 2012. Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vert, Tsuda, and Schölkopf. 2004. A Primer on Kernel Methods.” In Kernel Methods in Computational Biology.
Vishwanathan, Schraudolph, Kondor, et al. 2010. Graph Kernels.” Journal of Machine Learning Research.
Williams. 1996. Computing with Infinite Networks.” In Proceedings of the 9th International Conference on Neural Information Processing Systems. NIPS’96.
Wilson, and Adams. 2013. Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning.
Wilson, Dann, Lucas, et al. 2015. The Human Kernel.” arXiv:1510.07389 [Cs, Stat].
Wilson, and Ghahramani. 2011. Generalised Wishart Processes.” In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. UAI’11.
———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Wilson, Hu, Salakhutdinov, et al. 2016. Deep Kernel Learning.” In Artificial Intelligence and Statistics.
Wu. 1995. Compactly Supported Positive Definite Radial Functions.” Advances in Computational Mathematics.
Yaglom. 1987a. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics.
———. 1987b. Correlation Theory of Stationary and Related Random Functions Volume I.
Yu, Cheng, Schuurmans, et al. 2013. Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Zhang, and Paisley. 2019. Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning.