Suppose we have a real-valued stochastic process

\[\{\mathsf{x}(t)\}_{t\in \mathcal{T}} \] For now we may as well take the index to be \(\mathcal{T}\subseteq\mathbb{R}^D\), or at least a nice metric space.

The covariance kernel of \(\mathsf{x}\) is a function

\[\begin{aligned}\kappa:&\mathcal{T}\times \mathcal{T}&\to& \mathbb{R}\\ &s,t&\mapsto&\operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)). \end{aligned}\]

This is covariance in the usual sense, to wit,

\[\begin{aligned} \operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)) &:=\mathbb{E}[\mathsf{x}(s)-\mathbb{E}[\mathsf{x}(t)]] \mathbb{E}[\mathsf{x}(t)-\mathbb{E}[\mathsf{x}(t)]]\\ &=\mathbb{E}[\mathsf{x}(s)\mathsf{x}(t)]- \mathbb{E}[\mathsf{x}(s)]\mathbb{E}[\mathsf{x}(t)]\\ \end{aligned}\]

These are useful objects. In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with such covariances between values of stochastic processes at different values of their indices. The Karhunen—Loève transform decomposes stochastic processes into a basis of eigenfunctions of the covariance kernel operator.

Any process with finite second moments has a covariance function. They are especially renowned for Gaussian process methods, since Gaussian processes are uniquely specified by their mean function and covariance kernels, and also have the usual convenient algebraic properties by virtue of being Gaussian.

TODO: relate to representer theorems.

## Covariance kernels of some example processes

### A simple Markov chain

Consider a homogeneous continuous time Markov process taking values in \(\{0,1\}\). Suppose it has transition rate matrix

\[\left[\begin{array}{cc} 0 & \lambda\\ \lambda & 0 \end{array}\right] \] and moreover, that we start the chain from the stationary distribution, \([\frac 1 2\; \frac 1 2]^\top,\) which implies that \(\operatorname{Cov}(0, t)=\operatorname{Cov}(s, s+t)\) for all \(s\), and further, that \(\mathbb{E}[\mathsf{x}(t)]=\frac 1 2 \,\forall t\). So we know that \(\operatorname{Cov}(s,s+t)=\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]- \frac 1 4.\) What is \(\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]\)?

\[\begin{aligned} \mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)] &=\mathbb{P}[\{\mathsf{x}(0)=1\}\cap\{\mathsf{x}(t)=1\}]\\ &=\mathbb{P}[\text{number of jumps on \([0,t]\) is even}]\\ &=\mathbb{P}[\mathsf{z}\text{ is even}]\text{ where } \mathsf{z}\sim\operatorname{Poisson}(\lambda t)\\ &=\sum_{k=0}^{\infty}\frac{(\lambda t)^{2k} \exp(-\lambda t)}{(2k)!}\\ &= \exp(-\lambda t) \sum_{k=0}^{\infty}\frac{(\lambda t)^{2k}}{(2k)!}\\ &= \exp(-\lambda t)(\exp(-\lambda t) + \exp(\lambda t))/2 &\text{Taylor expansion}\\ &= \frac{\exp(-2\lambda t)}{2} + \frac{1}{2} \end{aligned}\]

From this we deduce that \(\operatorname{Cov}(s,s+t)=\frac{\exp(-2\lambda t)}{2} + \frac{1}{4}.\)

Question: what functions are admissible as covariance kernels for Markov chains?

### The Hawkes process

Covariance kernels are also important in various point processes. Notably, the Hawkes process was introduced in terms of its covariance. 🚧

### Gaussian processes

Handling certain functions in terms of their covariances is particularly convenient. Specifically, Gaussian processes are simple in this context. Such processes are

## General real covariance kernels

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}\) can be a covariance kernel if

- It is symmetric in its arguments \(K(s,t)=K(t,s)\) (more generally, conjugate symmetric — \(K(s,t)=K^*(t,s)\), but I think maybe my life will be simpler if I ignore the complex case for the moment.)
- It is positive semidefinite.

That positive semidefiniteness means that for arbitrary real numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0 \]

The interpretation here is since we need the covariances induced by the finite dimensional distributions fo this process to be consistent, it is necessary that

\[ \operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}= \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0 \]

This arises from the constraint on the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)

\[ \operatorname {var} (\mathbf {b} ^{\top}\mathbf {X} ) =\mathbf {b} ^{\top}\operatorname {var} (\mathbf {X} )\mathbf {b} \]

Question: What can we say about this covariance if every element of \(\mathbf X\) is non-negative?

Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. In practice designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of this notebook is devoted to such classes.

## Bonus: complex covariance kernels

I talked in terms of real kernels above because I generally observer real measurements of processes. But often complex covariances arise in a natural way too.

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}\) can be a covariance kernel if

- It is symmetric
*conjugate*symmetric in its arguments — \(K(s,t)=K^*(t,s)\), - It is positive semidefinite.

That positive semidefiniteness means that for arbitrary complex numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0. \]

Every analytic real kernel should also be a complex kernel, right? 🏗

## Kernel zoo

For some examples of covariance kernels, see the covariance kernel zoo.

## Learning kernels

This is usually in the context of Gaussian processes where everything can work out nicely if you are lucky, but other kernel machines are OK too. The goal for most of these seems to be to maximise the marginal posterior likelihood, a.k.a. model evidence, which will be familiar from every Bayesian ML method ever.

### Learning kernel hyperparameters

🏗

### Learning kernel composition

Automating kernel design by some composition of simpler atomic kernels. AFAICT this started from summaries like (Genton 2001) and went via Duvenaud’s aforementioned notes to became a small industry (Lloyd et al. 2014; D. K. Duvenaud, Nickisch, and Rasmussen 2011; D. Duvenaud et al. 2013; Grosse et al. 2012). A prominent example was the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues, which works by greedy combinatorial search over possible compositions.

More fashionable, presumably, are the differentiable search methods. For example, the AutoGP system (Krauth et al. 2016; Bonilla, Krauth, and Dezfouli 2019) incorporates tricks like these to use gradient descent to design kernels for Gaussian processes. (Sun et al. 2018) construct deep networks of composed kernels. I imagine the Deep Gaussian Process literature is also of this kind, but have not read it.

### Hyperkernels

Kernels on kernels, for learning kernels. 🏗 (Ong, Smola, and Williamson 2005, 2002; Ong and Smola 2003; Kondor and Jebara 2006)

## Non-positive kernels

As in, kernels which are not positive-definite. (Ong et al. 2004) 🏗

## References

*Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics*, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.

*Transactions of the American Mathematical Society*68 (3, 3): 337–404. https://doi.org/10.2307/1990404.

*Foundations and Trends® in Machine Learning*4 (3, 3): 195–266. https://doi.org/10.1561/2200000036.

*Proceedings of the 21st International Conference on Neural Information Processing Systems*, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.

*IEEE Transactions on Information Theory*62 (4, 4): 2184–2202. https://doi.org/10.1109/TIT.2016.2533397.

*Lectures on Fourier Integrals*. Princeton University Press. http://books.google.com?id=MWCYDwAAQBAJ.

*Journal of Machine Learning Research*20 (117): 1–63. http://arxiv.org/abs/1609.00577.

*Journal of Machine Learning Research*5 (December): 1035–62. http://dl.acm.org/citation.cfm?id=1005332.1016793.

*Journal of the American Statistical Association*94 (448, 448): 1330–39. https://doi.org/10.1080/01621459.1999.10473885.

*Advances in Neural Information Processing Systems*, 226–34. http://papers.nips.cc/paper/4221-additive-gaussian-processes.pdf.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.

*Quarterly Journal of the Royal Meteorological Society*125 (554): 723–57. https://doi.org/10.1002/qj.49712555417.

*Journal of Machine Learning Research*2 (December): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.

*Journal of Applied Probability*41 (1, 1): 236–49. https://doi.org/10.1239/jap/1077134681.

*Proceedings of the 22nd International Conference on Machine Learning - ICML ’05*, 241–48. Bonn, Germany: ACM Press. https://doi.org/10.1145/1102351.1102382.

*Journal of the American Statistical Association*97 (458): 590–600. https://doi.org/10.1198/016214502760047113.

*Journal of Multivariate Analysis*83 (2): 493–508. https://doi.org/10.1006/jmva.2001.2056.

*Journal of the American Statistical Association*105 (491, 491): 1167–77. https://doi.org/10.1198/jasa.2010.tm09420.

*Proceedings of the Conference on Uncertainty in Artificial Intelligence*. http://arxiv.org/abs/1210.4856.

*2010 IEEE International Workshop on Machine Learning for Signal Processing*, 379–84. Kittila, Finland: IEEE. https://doi.org/10.1109/MLSP.2010.5589113.

*Biometrika*58 (1): 83–90. https://doi.org/10.1093/biomet/58.1.83.

*Advances in Neural Information Processing Systems 20*, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc. http://papers.nips.cc/paper/3211-using-deep-belief-nets-to-learn-covariance-kernels-for-gaussian-processes.pdf.

*The Annals of Statistics*36 (3, 3): 1171–1220. https://doi.org/10.1214/009053607000000677.

*Proceedings of the 19th International Conference on Neural Information Processing Systems*, 729–36. NIPS’06. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2976456.2976548.

*Uai17*. http://arxiv.org/abs/1610.05392.

*Journal of Machine Learning Research*6 (Nov, Nov): 1783–1816. http://www.jmlr.org/papers/v6/lawrence05a.html.

*Twenty-Eighth AAAI Conference on Artificial Intelligence*. http://arxiv.org/abs/1402.4304.

*Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character*209 (441-458, 441-458): 415–46. https://doi.org/10.1098/rsta.1909.0016.

*Journal of Machine Learning Research*6 (Jul, Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.

*Neural Computation*17 (1, 1): 177–204. https://doi.org/10.1162/0899766052530802.

*Geoderma*, Pedometrics 2003, 128 (3–4, 3–4): 192–207. https://doi.org/10.1016/j.geoderma.2005.04.003.

*Machine Learning: A Probabilistic Perspective*. 1 edition. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.

*Deterministic and Statistical Methods in Machine Learning*, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg. http://bcl.hamilton.ie/~barak/papers/MLW-Jul-2005.pdf.

*Twenty-Fifth AAAI Conference on Artificial Intelligence*. https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3784.

*Twenty-First International Conference on Machine Learning - ICML ’04*, 81. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015443.

*Proceedings of the Twentieth International Conference on International Conference on Machine Learning*, 568–75. ICML’03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041910.

*Proceedings of the 15th International Conference on Neural Information Processing Systems*, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2968618.2968680.

*Journal of Machine Learning Research*6 (Jul, Jul): 1043–71. http://www.jmlr.org/papers/v6/ong05a.html.

*Statistics & Probability Letters*43 (4): 393–97. https://doi.org/10.1016/S0167-7152(98)00278-8.

*Statistics & Probability Letters*48 (1): 23–32. https://doi.org/10.1016/S0167-7152(99)00188-1.

*Infinite Dimensional Analysis, Quantum Probability and Related Topics*08 (01): 33–54. https://doi.org/10.1142/S0219025705001846.

*Journal of Machine Learning Research*9: 2491–521. http://www.jmlr.org/papers/v9/rakotomamonjy08a.html.

*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc. http://papers.nips.cc/paper/7050-non-stationary-spectral-kernels.pdf.

*Journal of the American Statistical Association*87 (417): 108–19. https://doi.org/10.1080/01621459.1992.10475181.

*Artificial Intelligence and Statistics*. http://www.jmlr.org/proceedings/papers/v22/sarkka12.html.

*IEEE Signal Processing Magazine*30 (4, 4): 51–61. https://doi.org/10.1109/MSP.2013.2246292.

*Computational Learning Theory*, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44581-1.

*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.

*Advanced Lectures on Machine Learning*, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1298–1306. Curran Associates, Inc. http://papers.nips.cc/paper/6180-learning-kernels-with-random-features.pdf.

*Journal of the American Statistical Association*100 (469): 310–21. https://doi.org/10.1198/016214504000000854.

*The Annals of Applied Statistics*3 (4, 4): 1236–65. https://doi.org/10.1214/09-AOAS312.

*International Conference on Artificial Intelligence and Statistics*, 111–21. PMLR. http://proceedings.mlr.press/v108/uziel20b.html.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*34 (3, 3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.

*Kernel Methods in Computational Biology*. MIT Press. http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2549.pdf.

*Journal of Machine Learning Research*11 (August): 1201–42. http://authors.library.caltech.edu/20528/1/Vishwanathan2010p11646J_Mach_Learn_Res.pdf.

*NIPS 2014 Workshop on Advances in Variational Inference*.

*International Conference on Machine Learning*. http://arxiv.org/abs/1302.4245.

*Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence*, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=3020548.3020633.

*Machine Learning and Knowledge Discovery in Databases*, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*Artificial Intelligence and Statistics*, 370–78. PMLR. http://proceedings.mlr.press/v51/wilson16.html.

*Advances in Computational Mathematics*4 (1): 283–92. https://doi.org/10.1007/BF03177517.

*Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References*. Springer Series in Statistics. New York, NY: Springer Science & Business Media.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.

*International Conference on Machine Learning*, 7424–33. http://proceedings.mlr.press/v97/zhang19k.html.