# Kernel zoo

What follows are some useful kernels to have in my toolkit, mostly over $$\mathbb{R}^n$$ or at least some space with a metric. There are many more than I could fit here, of course. And kernels are defined over many space, Real vectors, strings, other kernels, probability distributions etc.

For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter . Also wikipedia and .

TODO: kernel venn diagram.

## Stationary

A popular assumption, more or less implying implies that no region of the process is special. In this case the kernel may be written as a function purely of the distance between $K(\mathbf{x},t)=K(\|\mathbf{x}-t\|)$ for some distance $$\|\cdot\|$$ between the observation coordinates. This kind of translation-invariant kernel is the default. They are convenient analysed in terms of the Wienere-Khinchine theorem.

A general stationary kernel is the Hida-Matérn kernel .

## Dot-product

The kernel is a function of the inner product/dot product of the input coordinates, and we may overload notation to write $K(\mathbf{x},t\mathbf{y})=K(\mathbf{x}\cdot \mathbf{y}).$ Hard to google for because there is the confounding fact that kernels already define inner products in some other space so it’s a dot product defined in terms of a dot product.

Such kernels are rotation invariant but not stationary. Instead of the Fourier relationships that stationary kernels have, these can be described in Legendre bases for radial functions Smola, Óvári, and Williamson (2000) manufacture some theorems which tell you whether your choice of inner product can in fact define an inner product kernel. Unfortunately the basis so defined are long and ugly and not especially tractable to work with except maybe by computational algebra systems.

## NN kernels

Infinite-width random NN kernels are nearly dot product kernels. They depend on the several dot products, $$\mathbf{x}\cdot \mathbf{x}$$, $$\mathbf{x}\cdot \mathbf{y}$$ and $$\mathbf{y}\cdot \mathbf{y}$$. See NN kernels.

## Causal kernels

Time-indexed processes are more general than a standard Wiener process. 🏗

What constraints make a covariance kernel causal? This is not always easily expressed in terms of the covariance kernel; you want something like the inverse covariance/precision.

### Wiener process kernel

The covariance kernel which is possessed by a standard Wiener process, which is a process with Gaussian increments, which is indeed a certain type of dependence. It is over a boring index space, time $$t\in \mathbb{R}$$. We can read this right off the Wiener process Wikipedia page: For a Gaussian process $$\{W_t\}_{t\in\mathbb{R}},$$

${\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t}$

Here $$s \wedge t$$ here means “the minimum of $$s$$ and $$t$$”. From it we can immediately construct the kernel $$K(s,t):=s \wedge t$$.

## Squared exponential

A.k.a. exponentiated quadratic. Often “radial basis functions” mean this also, although not always.

The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.

$K_{\textrm{SE}}(\mathbf{x}, \mathbf{x}') = \sigma^2\exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{2\ell^2}\right)$

Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.

$K_{\textrm{RQ}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \left( 1 + \frac{(\mathbf{x} - \mathbf{x}')^2}{2 \alpha \ell^2} \right)^{-\alpha}$

Note that $$\lim_{\alpha\to\infty} K_{\textrm{RQ}}= K_{\textrm{SE}}$$.

## Matérn

The Matérn stationary (and in the Euclidean case, isotropic) covariance function is one a surprisingly convenient model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook .

$K_{\textrm{Mat}}(\mathbf{x}, \mathbf{x}')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{\mathbf{x} - \mathbf{x}'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{\mathbf{x} - \mathbf{x}'}{\rho}\right)$

where $$\Gamma$$ is the gamma function, $$\ K_{\nu }$$ is the modified Bessel function of the second kind, and $$\rho,\nu\geq 0$$. We many then read the differentiability of the solution directly off the parameterization.

## Periodic

$K_{\textrm{Per}}(\mathbf{x}, \mathbf{x}') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right)$

## Locally periodic

This is an example of a composed kernel.

\begin{aligned} K_{\textrm{LocPer}}(\mathbf{x}, \mathbf{x}') &= K_{\textrm{Per}}(\mathbf{x}, \mathbf{x}')K_{\textrm{SE}}(\mathbf{x}, \mathbf{x}') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right) \exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{2\ell^2}\right) \end{aligned}

Obviously there are other possible localisations of a periodic kernel. This is a locally periodic kernel. NB it is not local in the sense of Genton’s local stationarity, just local in the sense that one kernel is ‘enveloped’ by another.

## “Integral” kernel

I just noticed the ambiguously named Integral kernel:

I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.

Examples include:

• Knowing how far a robot has travelled after 2, 4, 6 and 8 seconds, but wanting an estimate of its speed after 5 seconds…
• Wanting to know an estimate of the density of people aged 23, when we only have the total count for binned age ranges…

I argue that all kernels are naturally defined in terms of integrals, but the author seems to mean something particular. I suspect I would call this a sampling kernel, but that name is also overloaded. Anyway, what is actually going on here? Where is it introduced? Possibly one of .

## Stationary spectral kernels

construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary. You could think of this as a kind of limiting case of composing kernels with a Fourier basis. See Bochner’s theorem.

## Nonstationary spectral kernels

use a generalised Bochner Theorem often called Yaglom’s Theorem, which does not presume stationarity. See Yaglom’s theorem.

It is not immediately clear how to use this; spectral representations are not an intuitive way of constructing things.

## Compactly supported

We usually think about compactly supported kernels in the stationary isotropic case, where we mean kernels that vanish whenever the distance between two observation $$\mathbf{x},\mathbf{y}$$ is larger than a certain cut-off distance $$L,$$ i.e. $$\|\mathbf{x}-\mathbf{y}\|>L\Rightarrow K(\mathbf{x},\mathbf{y})=0$$. These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without good reason, so maybe it is that?

Genton (2001) mentions $\max \left\{\left(1-\frac{\|\mathbf{x}-\mathbf{y}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\}$ and handballs us to Gneiting (2002b) for a bigger smörgåsbord of stationary compactly supported kernels. Gneiting (2002b) has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly concerns producing compactly supported kernels via clever integral transforms.

For inner product kernels, this can be diabolical. The Schaback and Wu method discusses some operations that preserve positive-definiteness.

NB if you are trying specifically to enforce sparsity here, it might be worth considering the kernel induced by a stochaastic convolution, which is kind of a precision parameterisation.

## Markov kernels

How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. Any kernel $$K(s,t)=K(s\wedge t)$$ is clearly Markov. Are there more general ones? TODO: relate to kernels of bounded support. 🏗

## Genton kernels

That’s my name for them because they seem to originate in .

For any non-negative function $$h:\mathcal{T}\to\mathbb{R}^+$$ with $$h(\mathbf{0})=0,$$ the following is a kernel:

$K(\mathbf{x}, \mathbf{y})=\frac{[h(\mathbf{x}+\mathbf{y})-h(\mathbf{x}-\mathbf{y})}{4}$ Genton gives the example of $$h:\mathbf{x}\mapsto \|\mathbf{x}\|_2^2.$$ instance, consider the function $$h(\mathbf{x})=\mathbf{x}^{\top} \mathbf{x} .$$ From this we obtain the kernel: $K(\mathbf{x}, \mathbf{z})=\frac{1}{4}\left[(\mathbf{x}+\mathbf{z})^{\top}(\mathbf{x}+\mathbf{z})-(\mathbf{x}-\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z})\right]=\mathbf{x}^{\top} \mathbf{z}$ The motivation is the identity

$\operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \frac{\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)}{ 4}.$

## Kernels with desired symmetry

summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🏗 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.

## Stationary reducible kernels

See kernel warping.

## References

Abrahamsen, Petter. 1997.
Agarwal, Arvind, and Hal Daumé Iii. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92.
Alexanderian, Alen. 2015. arXiv:1509.07526 [Math], October.
Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. Foundations and Trends® in Machine Learning 4 (3): 195–266.
Aronszajn, N. 1950. Transactions of the American Mathematical Society 68 (3): 337–404.
Bach, Francis. 2008. In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc.
Bacry, Emmanuel, and Jean-François Muzy. 2016. IEEE Transactions on Information Theory 62 (4): 2184–2202.
Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. arXiv:1606.05241 [Stat], June.
Bochner, Salomon. 1959. Lectures on Fourier Integrals. Princeton University Press.
Bohn, Bastian, Michael Griebel, and Christian Rieger. 2018. arXiv:1709.10441 [Cs, Math], June.
Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2019. Journal of Machine Learning Research 20 (117): 1–63.
Chen, Wanfang, Marc G. Genton, and Ying Sun. 2021. Annual Review of Statistics and Its Application 8 (1): 191–215.
Cho, Youngmin, and Lawrence K. Saul. 2009. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 22:342–50. NIPS’09. Red Hook, NY, USA: Curran Associates Inc.
Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. UCB/EECS-2009-96. EECS Department, University of California, Berkeley.
Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. Journal of Machine Learning Research 5 (December): 1035–62.
Cressie, Noel, and Hsin-Cheng Huang. 1999. Journal of the American Statistical Association 94 (448): 1330–39.
Delft, Anne van, and Michael Eichler. 2016. arXiv:1602.05125 [Math, Stat], February.
Dowling, Matthew, Piotr Sokół, and Il Memming Park. 2021. arXiv.
Duvenaud, David. 2014. PhD Thesis, University of Cambridge.
Duvenaud, David K., Hannes Nickisch, and Carl E. Rasmussen. 2011. In Advances in Neural Information Processing Systems, 226–34.
Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74.
Gaspari, Gregory, and Stephen E. Cohn. 1999. Quarterly Journal of the Royal Meteorological Society 125 (554): 723–57.
Genton, Marc G. 2001. Journal of Machine Learning Research 2 (December): 299–312.
Genton, Marc G., and Olivier Perrin. 2004. Journal of Applied Probability 41 (1): 236–49.
Girolami, Mark, and Simon Rogers. 2005. In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05, 241–48. Bonn, Germany: ACM Press.
Gneiting, Tilmann. 2002a. Journal of the American Statistical Association 97 (458): 590–600.
———. 2002b. Journal of Multivariate Analysis 83 (2): 493–508.
Gneiting, Tilmann, William Kleiber, and Martin Schlather. 2010. Journal of the American Statistical Association 105 (491): 1167–77.
Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Hartikainen, J., and S. Särkkä. 2010. In 2010 IEEE International Workshop on Machine Learning for Signal Processing, 379–84. Kittila, Finland: IEEE.
Hawkes, Alan G. 1971. Biometrika 58 (1): 83–90.
Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc.
Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. The Annals of Statistics 36 (3): 1171–1220.
Kar, Purushottam, and Harish Karnick. 2012. In Artificial Intelligence and Statistics, 583–91. PMLR.
Kom Samo, Yves-Laurent, and Stephen Roberts. 2015. arXiv:1506.02236 [Stat], June.
Kondor, Risi, and Tony Jebara. 2006. In Proceedings of the 19th International Conference on Neural Information Processing Systems, 729–36. NIPS’06. Cambridge, MA, USA: MIT Press.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. In UAI17.
Lawrence, Neil. 2005. Journal of Machine Learning Research 6 (Nov): 1783–1816.
Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Mercer, J. 1909. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441-458): 415–46.
Micchelli, Charles A., and Massimiliano Pontil. 2005a. Journal of Machine Learning Research 6 (Jul): 1099–1125.
———. 2005b. Neural Computation 17 (1): 177–204.
Minasny, Budiman, and Alex. B. McBratney. 2005. Geoderma, Pedometrics 2003, 128 (3–4): 192–207.
Murphy, Kevin P. 2012. Machine learning: a probabilistic perspective. 1 edition. Adaptive computation and machine learning series. Cambridge, MA: MIT Press.
Murray-Smith, Roderick, and Barak A. Pearlmutter. 2005. In Deterministic and Statistical Methods in Machine Learning, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
O’Callaghan, Simon Timothy, and Fabio T. Ramos. 2011. In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Ong, Cheng Soon, Xavier Mary, Stéphane Canu, and Alexander J. Smola. 2004. In Twenty-First International Conference on Machine Learning - ICML ’04, 81. Banff, Alberta, Canada: ACM Press.
Ong, Cheng Soon, and Alexander J. Smola. 2003. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 568–75. ICML’03. Washington, DC, USA: AAAI Press.
Ong, Cheng Soon, Alexander J. Smola, and Robert C. Williamson. 2002. Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press.
———. 2005. Journal of Machine Learning Research 6 (Jul): 1043–71.
Paciorek, Christopher J., and Mark J. Schervish. 2003. In Proceedings of the 16th International Conference on Neural Information Processing Systems, 16:273–80. NIPS’03. Cambridge, MA, USA: MIT Press.
Pérez-Abreu, Víctor, and Alfonso Rocha-Arteaga. 2005. Infinite Dimensional Analysis, Quantum Probability and Related Topics 08 (01): 33–54.
Perrin, Olivier, and Rachid Senoussi. 1999. Statistics & Probability Letters 43 (4): 393–97.
———. 2000. Statistics & Probability Letters 48 (1): 23–32.
Pfaffel, Oliver. 2012. arXiv:1201.3256 [Math], January.
Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. SimpleMKL.” Journal of Machine Learning Research 9 (Nov): 2491–521.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.
Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc.
———. 2018. arXiv:1811.10978 [Cs, Stat], November.
Sampson, Paul D., and Peter Guttorp. 1992. Journal of the American Statistical Association 87 (417): 108–19.
Särkkä, Simo, and Jouni Hartikainen. 2012. In Artificial Intelligence and Statistics.
Särkkä, Simo, A. Solin, and J. Hartikainen. 2013. IEEE Signal Processing Magazine 30 (4): 51–61.
Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. In Computational Learning Theory, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
———. 2003. In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg.
Sinha, Aman, and John C Duchi. 2016. In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1298–1306. Curran Associates, Inc.
Smith, Michael Thomas, Mauricio A. Alvarez, and Neil D. Lawrence. 2018. arXiv:1809.02010 [Cs, Stat], September.
Smola, Alex J., Zoltán L. Óvári, and Robert C. Williamson. 2000. In Proceedings of the 13th International Conference on Neural Information Processing Systems, 290–96. NIPS’00. Cambridge, MA, USA: MIT Press.
Stein, Michael L. 2005. Journal of the American Statistical Association 100 (469): 310–21.
Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Székely, Gábor J., and Maria L. Rizzo. 2009. The Annals of Applied Statistics 3 (4): 1236–65.
Tsuchida, Russell, Fred Roosta, and Marcus Gallagher. 2018. In International Conference on Machine Learning, 4995–5004. PMLR.
———. 2019. arXiv:1911.12927 [Cs, Stat], November.
Uziel, Guy. 2020. In International Conference on Artificial Intelligence and Statistics, 111–21. PMLR.
Vedaldi, A., and A. Zisserman. 2012. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3): 480–92.
Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. In Kernel Methods in Computational Biology. MIT Press.
Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. Journal of Machine Learning Research 11 (August): 1201–42.
Wilk, Mark van der, Andrew G. Wilson, and Carl E. Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.
Williams, Christopher K. I. 1996. In Proceedings of the 9th International Conference on Neural Information Processing Systems, 295–301. NIPS’96. Cambridge, MA, USA: MIT Press.
Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. In International Conference on Machine Learning.
Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. arXiv:1510.07389 [Cs, Stat], October.
Wilson, Andrew Gordon, and Zoubin Ghahramani. 2011. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press.
———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2016. In Artificial Intelligence and Statistics, 370–78. PMLR.
Wu, Zongmin. 1995. Advances in Computational Mathematics 4 (1): 283–92.
Yaglom, A. M. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics. New York, NY: Springer Science & Business Media.
Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78.
Zhang, Aonan, and John Paisley. 2019. In International Conference on Machine Learning, 7424–33.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.