# Covariance functions

## Variograms, Mercer kernels, positive definite operators, spare reproducing kernels for that Hilbert space I bought on eBay real cheap

A realisation of a nonstationary rough covariance process (partially observed)

On the interpretation of kernels as the covariance functions of stochastic processes, whcih is one way to define stochastic processes.

Suppose we have a real-valued stochastic process

$\{\mathsf{x}(t)\}_{t\in \mathcal{T}}$ For now we may as well take the index to be $$\mathcal{T}\subseteq\mathbb{R}^D$$, or at least a nice metric space.

The covariance kernel of $$\mathsf{x}$$ is a function

\begin{aligned}\kappa:&\mathcal{T}\times \mathcal{T}&\to& \mathbb{R}\\ &s,t&\mapsto&\operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)). \end{aligned}

This is covariance in the usual sense, to wit,

\begin{aligned} \operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)) &:=\mathbb{E}[\mathsf{x}(s)-\mathbb{E}[\mathsf{x}(t)]] \mathbb{E}[\mathsf{x}(t)-\mathbb{E}[\mathsf{x}(t)]]\\ &=\mathbb{E}[\mathsf{x}(s)\mathsf{x}(t)]- \mathbb{E}[\mathsf{x}(s)]\mathbb{E}[\mathsf{x}(t)]\\ \end{aligned}

These are useful objects. In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with such covariances between values of stochastic processes at different values of their indices. The Karhunen—Loève transform decomposes stochastic processes into a basis of eigenfunctions of the covariance kernel operator. We might consider them in terms of processes defined through convolution instead.

Any process with finite second moments has a covariance function. They are especially renowned for Gaussian process methods, since Gaussian processes are uniquely specified by their mean function and covariance kernels, and also have the usual convenient algebraic properties by virtue of being Gaussian.

TODO: relate to representer theorems.

Han Lin SHANG’s slides, relating this to PCA in functional data

## Covariance kernels of some example processes

### A simple Markov chain

Consider a homogeneous continuous time Markov process taking values in $$\{0,1\}$$. Suppose it has transition rate matrix

$\left[\begin{array}{cc} 0 & \lambda\\ \lambda & 0 \end{array}\right]$ and moreover, that we start the chain from the stationary distribution, $$[\frac 1 2\; \frac 1 2]^\top,$$ which implies that $$\operatorname{Cov}(0, t)=\operatorname{Cov}(s, s+t)$$ for all $$s$$, and further, that $$\mathbb{E}[\mathsf{x}(t)]=\frac 1 2 \,\forall t$$. So we know that $$\operatorname{Cov}(s,s+t)=\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]- \frac 1 4.$$ What is $$\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]$$?

\begin{aligned} \mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)] &=\mathbb{P}[\{\mathsf{x}(0)=1\}\cap\{\mathsf{x}(t)=1\}]\\ &=\mathbb{P}[\text{number of jumps on $$[0,t]$$ is even}]\\ &=\mathbb{P}[\mathsf{z}\text{ is even}]\text{ where } \mathsf{z}\sim\operatorname{Poisson}(\lambda t)\\ &=\sum_{k=0}^{\infty}\frac{(\lambda t)^{2k} \exp(-\lambda t)}{(2k)!}\\ &= \exp(-\lambda t) \sum_{k=0}^{\infty}\frac{(\lambda t)^{2k}}{(2k)!}\\ &= \exp(-\lambda t)(\exp(-\lambda t) + \exp(\lambda t))/2 &\text{Taylor expansion}\\ &= \frac{\exp(-2\lambda t)}{2} + \frac{1}{2} \end{aligned}

From this we deduce that $$\operatorname{Cov}(s,s+t)=\frac{\exp(-2\lambda t)}{2} + \frac{1}{4}.$$

Question: what functions are admissible as covariance kernels for Markov chains?

### The Hawkes process

Covariance kernels are also important in various point processes. Notably, the Hawkes process was introduced in terms of its covariance. 🚧

### Gaussian processes

Handling certain functions in terms of their covariances is particularly convenient. Specifically, Gaussian processes are simple in this context. Such processes are

## General real covariance kernels

A function $$K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}$$ can be a covariance kernel if

1. It is symmetric in its arguments $$K(s,t)=K(t,s)$$ (more generally, conjugate symmetric — $$K(s,t)=K^*(t,s)$$, but I think maybe my life will be simpler if I ignore the complex case for the moment.)
2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary real numbers $$c_1,\dots,c_k$$ and arbitrary indices $$t_1,\dots,t_k$$

$\sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0$

The interpretation here is since we need the covariances induced by the finite dimensional distributions of this process to be consistent, it is necessary that

$\operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}= \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0$

This arises from the constraint on the covariance of $$\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d$$ which requires that for $$\mathbf {b}\in \mathbb{R}^d$$

$\operatorname {var} (\mathbf {b} ^{\top}\mathbf {X} ) =\mathbf {b} ^{\top}\operatorname {var} (\mathbf {X} )\mathbf {b}$

Question: What can we say about this covariance if every element of $$\mathbf X$$ is non-negative?

Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. In practice designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of this notebook is devoted to such classes.

## Bonus: complex covariance kernels

I talked in terms of real kernels above because I generally observer real measurements of processes. But often complex covariances arise in a natural way too.

A function $$K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}$$ can be a covariance kernel if

1. It is symmetric conjugate symmetric in its arguments — $$K(s,t)=K^*(t,s)$$,
2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary complex numbers $$c_1,\dots,c_k$$ and arbitrary indices $$t_1,\dots,t_k$$

$\sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0.$

Every analytic real kernel should also be a complex kernel, right? 🏗

## Kernel zoo

For some examples of covariance kernels, see the kernel zoo.

## Learning kernels

See learning kernels.

## Non-positive kernels

As in, kernels which are not positive-definite. For example, [Ong et al. (2004);SahaLearning2020] 🏗

## References

Abrahamsen, Petter. 1997.
Agarwal, Arvind, and Hal Daumé Iii. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92.
Alexanderian, Alen. 2015. arXiv:1509.07526 [Math], October.
Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. Foundations and Trends® in Machine Learning 4 (3): 195–266.
Aronszajn, N. 1950. Transactions of the American Mathematical Society 68 (3): 337–404.
Bach, Francis. 2008. In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc.
Bacry, Emmanuel, and Jean-François Muzy. 2016. IEEE Transactions on Information Theory 62 (4): 2184–2202.
Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. arXiv:1606.05241 [Stat], June.
Bochner, Salomon. 1959. Lectures on Fourier Integrals. Princeton University Press.
Bohn, Bastian, Michael Griebel, and Christian Rieger. 2018. arXiv:1709.10441 [Cs, Math], June.
Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2019. Journal of Machine Learning Research 20 (117): 1–63.
Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. UCB/EECS-2009-96. EECS Department, University of California, Berkeley.
Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. Journal of Machine Learning Research 5 (December): 1035–62.
Cressie, Noel, and Hsin-Cheng Huang. 1999. Journal of the American Statistical Association 94 (448): 1330–39.
Delft, Anne van, and Michael Eichler. 2016. arXiv:1602.05125 [Math, Stat], February.
Duvenaud, David. 2014. PhD Thesis, University of Cambridge.
Duvenaud, David K., Hannes Nickisch, and Carl E. Rasmussen. 2011. In Advances in Neural Information Processing Systems, 226–34.
Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74.
Gaspari, Gregory, and Stephen E. Cohn. 1999. Quarterly Journal of the Royal Meteorological Society 125 (554): 723–57.
Genton, Marc G. 2001. Journal of Machine Learning Research 2 (December): 299–312.
Genton, Marc G., and Olivier Perrin. 2004. Journal of Applied Probability 41 (1): 236–49.
Girolami, Mark, and Simon Rogers. 2005. In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05, 241–48. Bonn, Germany: ACM Press.
Gneiting, Tilmann. 2002a. Journal of the American Statistical Association 97 (458): 590–600.
———. 2002b. Journal of Multivariate Analysis 83 (2): 493–508.
Gneiting, Tilmann, William Kleiber, and Martin Schlather. 2010. Journal of the American Statistical Association 105 (491): 1167–77.
Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Hartikainen, J., and S. Särkkä. 2010. In 2010 IEEE International Workshop on Machine Learning for Signal Processing, 379–84. Kittila, Finland: IEEE.
Hawkes, Alan G. 1971. Biometrika 58 (1): 83–90.
Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc.
Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. The Annals of Statistics 36 (3): 1171–1220.
Kom Samo, Yves-Laurent, and Stephen Roberts. 2015. arXiv:1506.02236 [Stat], June.
Kondor, Risi, and Tony Jebara. 2006. In Proceedings of the 19th International Conference on Neural Information Processing Systems, 729–36. NIPS’06. Cambridge, MA, USA: MIT Press.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. In Uai17.
Lawrence, Neil. 2005. Journal of Machine Learning Research 6 (Nov): 1783–1816.
Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Mercer, J. 1909. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441-458): 415–46.
Micchelli, Charles A., and Massimiliano Pontil. 2005a. Journal of Machine Learning Research 6 (Jul): 1099–1125.
———. 2005b. Neural Computation 17 (1): 177–204.
Minasny, Budiman, and Alex. B. McBratney. 2005. Geoderma, Pedometrics 2003, 128 (3–4): 192–207.
Murphy, Kevin P. 2012. Machine learning: a probabilistic perspective. 1 edition. Adaptive computation and machine learning series. Cambridge, MA: MIT Press.
Murray-Smith, Roderick, and Barak A. Pearlmutter. 2005. In Deterministic and Statistical Methods in Machine Learning, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
O’Callaghan, Simon Timothy, and Fabio T. Ramos. 2011. In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Ong, Cheng Soon, Xavier Mary, Stéphane Canu, and Alexander J. Smola. 2004. In Twenty-First International Conference on Machine Learning - ICML ’04, 81. Banff, Alberta, Canada: ACM Press.
Ong, Cheng Soon, and Alexander J. Smola. 2003. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 568–75. ICML’03. Washington, DC, USA: AAAI Press.
Ong, Cheng Soon, Alexander J. Smola, and Robert C. Williamson. 2002. Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press.
———. 2005. Journal of Machine Learning Research 6 (Jul): 1043–71.
Pérez-Abreu, Víctor, and Alfonso Rocha-Arteaga. 2005. Infinite Dimensional Analysis, Quantum Probability and Related Topics 08 (01): 33–54.
Perrin, Olivier, and Rachid Senoussi. 1999. Statistics & Probability Letters 43 (4): 393–97.
———. 2000. Statistics & Probability Letters 48 (1): 23–32.
Pfaffel, Oliver. 2012. arXiv:1201.3256 [Math], January.
Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. SimpleMKL.” Journal of Machine Learning Research 9 (Nov): 2491–521.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.
Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc.
———. 2018. arXiv:1811.10978 [Cs, Stat], November.
Sampson, Paul D., and Peter Guttorp. 1992. Journal of the American Statistical Association 87 (417): 108–19.
Särkkä, Simo, and Jouni Hartikainen. 2012. In Artificial Intelligence and Statistics.
Särkkä, Simo, A. Solin, and J. Hartikainen. 2013. IEEE Signal Processing Magazine 30 (4): 51–61.
Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. In Computational Learning Theory, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
———. 2003. In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg.
Sinha, Aman, and John C Duchi. 2016. In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1298–1306. Curran Associates, Inc.
Smith, Michael Thomas, Mauricio A. Alvarez, and Neil D. Lawrence. 2018. arXiv:1809.02010 [Cs, Stat], September.
Stein, Michael L. 2005. Journal of the American Statistical Association 100 (469): 310–21.
Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Székely, Gábor J., and Maria L. Rizzo. 2009. The Annals of Applied Statistics 3 (4): 1236–65.
Uziel, Guy. 2020. In International Conference on Artificial Intelligence and Statistics, 111–21. PMLR.
Vedaldi, A., and A. Zisserman. 2012. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3): 480–92.
Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. In Kernel Methods in Computational Biology. MIT Press.
Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. Journal of Machine Learning Research 11 (August): 1201–42.
Wilk, Mark van der, Andrew G. Wilson, and Carl E. Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.
Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. In International Conference on Machine Learning.
Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. arXiv:1510.07389 [Cs, Stat], October.
Wilson, Andrew Gordon, and Zoubin Ghahramani. 2011. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press.
———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2016. In Artificial Intelligence and Statistics, 370–78. PMLR.
Wu, Zongmin. 1995. Advances in Computational Mathematics 4 (1): 283–92.
Yaglom, A. M. 1961. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory, January, 593–622.
———. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics. New York, NY: Springer Science & Business Media.
Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78.
Zhang, Aonan, and John Paisley. 2019. In International Conference on Machine Learning, 7424–33.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.