- Covariance kernels of some example processes
- General real covariance kernels
- Bonus: complex covariance kernels
- Kernel zoo
- Stationary
- Wiener process kernel
- Causal kernels
- Markov kernels
- Squared exponential
- Rational Quadratic
- Matérn
- Periodic
- Locally periodic
- With atoms
- “Integral” kernel
- Composing kernels
- Stationary spectral kernels
- Non-stationary spectral kernels
- Locally stationary
- Genton kernels
- Compactly supported
- Kernels with desired symmetry

- Learning kernels
- Non-positive kernels

Suppose we have a real-valued stochastic process

\[\{\mathsf{x}(t)\}_{t\in \mathcal{T}} \] indexed by some index set \(\mathcal{T}\). For now we may as well take \(\mathcal{T}\subseteq\mathbb{R}^D\), or at least be a nice metric space.

The covariance kernel of \(\mathsf{x}\) is a function

\[\begin{aligned}\kappa:&\mathcal{T}\times \mathcal{T}&\to& \mathbb{R}\\ &s,t&\mapsto&\operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)). \end{aligned}\]

This is covariance in the usual sense, to wit,

\[\begin{aligned} \operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)) &:=\mathbb{E}[\mathsf{x}(s)-\mathbb{E}[\mathsf{x}(t)]] \mathbb{E}[\mathsf{x}(t)-\mathbb{E}[\mathsf{x}(t)]]\\ &=\mathbb{E}[\mathsf{x}(s)\mathsf{x}(t)]- \mathbb{E}[\mathsf{x}(s)]\mathbb{E}[\mathsf{x}(t)]\\ \end{aligned}\]

These are useful objects. In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with such covariances between values of stochastic processes at different values of their indices. The Karhunen–Loève transform decomposes stochastic processes into a basis of eigenfunctions of the covariance kernel operator.

Any process with finite second moments has a covariance function. They are especially renowned for Gaussian process methods, since Gaussian processes are uniquely specified by their mean function and covariance kernels, and also have the usual convenient algebraic properties by virtue of being Gaussian.

TODO: contextualise with representer theorems.

## Covariance kernels of some example processes

### A simple Markov chain

Consider a homogeneous continuous time Markov process taking values in \(\{0,1\}\). Suppose it has transition rate matrix

\[\left[\begin{array}{cc} 0 & \lambda\\ \lambda & 0 \end{array}\right] \] and moreover, that we start the chain from the stationary distribution, \([\frac 1 2\; \frac 1 2]^\top,\) which implies that \(\operatorname{Cov}(0, t)=\operatorname{Cov}(s, s+t)\) for all \(s\), and further, that \(\mathbb{E}[\mathsf{x}(t)]=\frac 1 2 \,\forall t\). So we know that \(\operatorname{Cov}(s,s+t)=\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]- \frac 1 4.\) What is \(\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]\)?

\[\begin{aligned} \mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)] &=\mathbb{P}[\{\mathsf{x}(0)=1\}\cap\{\mathsf{x}(t)=1\}]\\ &=\mathbb{P}[\text{number of jumps on \([0,t]\) is even}]\\ &=\mathbb{P}[\mathsf{z}\text{ is even}]\text{ where } \mathsf{z}\sim\operatorname{Poisson}(\lambda t)\\ &=\sum_{k=0}^{\infty}\frac{(\lambda t)^{2k} \exp(-\lambda t)}{(2k)!}\\ &= \exp(-\lambda t) \sum_{k=0}^{\infty}\frac{(\lambda t)^{2k}}{(2k)!}\\ &= \exp(-\lambda t)(\exp(-\lambda t) + \exp(\lambda t))/2 &\text{Taylor expansion}\\ &= \frac{\exp(-2\lambda t)}{2} + \frac{1}{2} \end{aligned}\]

From this we deduce that \(\operatorname{Cov}(s,s+t)=\frac{\exp(-2\lambda t)}{2} + \frac{1}{4}.\)

### The Hawkes process

Covariance kernels are also important in various point processes. Notably, the Hawkes process was introduced in terms of its covariance. 🚧

### Gaussian processes

The big one, that I will be assuming henceforth, is covered elsewhere.

## General real covariance kernels

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}\) can be a covariance kernel if

- It is symmetric in its arguments \(K(s,t)=K(t,s)\) (more generally, conjugate symmetric – \(K(s,t)=K^*(t,s)\), but I think maybe my life will be simpler if I ignore the complex case for the moment.)
- It is positive semidefinite.

That positive semidefiniteness means that for arbitrary real numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0 \]

The interpretation here is that this is the same as the covariance of different random variables induced by the process, and we need them to be consistent, which implies that it is necessary that

\[ \operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}= \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0 \]

This arises from the constraint on the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)

\[ \operatorname {var} (\mathbf {b} ^{\top}\mathbf {X} ) =\mathbf {b} ^{\top}\operatorname {var} (\mathbf {X} )\mathbf {b} \]

What can we say about this covariance if every element of \(\mathbf X\) is non-negative?

Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. In practice designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of this notebook is devoted to such classes.

## Bonus: complex covariance kernels

I talked in terms of real kernels above because I generally observer real measurements of processes. But often complex covariances arise in a natural way too.

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}\) can be a covariance kernel if

- It is symmetric
*conjugate*symmetric in its arguments – \(K(s,t)=K^*(t,s)\), - It is positive semidefinite.

That positive semidefiniteness means that for arbitrary complex numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0. \]

Every analytic real kernel should also be a complex kernel, right? 🏗.

## Kernel zoo

What follows are some useful kernels to have in my toolkit, mostly over \(\mathbb{R}^n\) or at least some space with a metric. There are many more than I could fit here, over many more spaces than I need. (Real vectors, strings, other kernels, probability distributions etc.)

For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter (Duvenaud 2014). Also wikipedia and (Abrahamsen 1997; Genton 2002).

### Stationary

A popular assumption, more or less implying implies that no region of the process is *special*.
In this case the kernel may be written as a function purely of the distance between observations, i.e.

\[K(s,t)=K(\|s,t|)\] for some distance \(\|\cdot\|\) between the observation coordinates.

### Wiener process kernel

From the naming, we might suspect that a Gaussian process would also describe a
standard Wiener process,
which after all is a process with Gaussian *increments*,
which is certain type of dependence.
It is over a boring index space, time
\(t\in \mathbb{R}\), but
there is indeed nothing stopping us.

We can read this right off the Wiener process Wikipedia page. For a Gaussian process \(\{W_t\}_{t\in\mathbb{R}},\)

\[ {\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t} \]

Here \(s \wedge t\) here means “the minimum of \(s\) and \(t\)”.

That result is standard. From it we can immediately construct the kernel \(K(s,t):=s \wedge t\).

### Causal kernels

Time-indexed processes more general than a standard Wiener process. We can construct kernels that are more general than this, right? 🏗

TODO: pd in this context? stationary in this context? homogenous in this context?

### Markov kernels

How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. Any kernel \(k(s,t)=k(s\wedge t)\) is clearly Markov. Are there more general ones? TODO: relate to kernels of bounded support. 🏗

TODO: pd in this context? stationary in this context? homogenous in this context?

### Squared exponential

The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.

\[k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)\]

### Rational Quadratic

Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.

\[k_{\textrm{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{(x - x')^2}{2 \alpha \ell^2} \right)^{-\alpha}\]

Note that \(\lim_{\alpha\to\infty} k_{\textrm{RQ}}= k_{\textrm{SE}}\).

### Matérn

The Matérn stationary (and in the Euclidean case, isotropic) covariance function is one model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook (Rasmussen and Williams 2006).

\[ k_{\textrm{Mat}}(x, x')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right) \]

where \(\Gamma\) is the gamma function, \(\ K_{\nu }\) is the modified Bessel function of the second kind, and \(\rho,\nu\geq 0\).

AFAICT you use this for covariances hypothesised to be less smooth than the squared exponential covariance. And other things?

### Periodic

\[ k_{\textrm{Per}}(x, x') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \]

### Locally periodic

This is an example of a composed kernel, explained below.

\[\begin{aligned} k_{\textrm{LocPer}}(x, x') &= k_{\textrm{Per}}(x, x')k_{\textrm{SE}}(x, x') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell^2}\right) \end{aligned}\]

Obviously there are other possible localisations of a
periodic kernel. This is *a* Locally Periodic kernel.

### With atoms

🏗 Is this feasible? What are the constructions that allow discontinuity in the process and what difficulties to they engender?

### “Integral” kernel

I just noticed the ambiguously named Integral kernel:

I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.

Examples include:

- Knowing how far a robot has travelled after 2, 4, 6 and 8 seconds, but wanting an estimate of its speed after 5 seconds…
- Wanting to know an estimate of the density of people aged 23, when we only have the total count for binned age ranges…

I would argue that *all* kernels are naturally defined in terms of integrals,
but the author seems to mean something particular.
I suspect I would call this a sampling
kernel, but that name is also overloaded.
Anyway, what is *actually* going on here?
Where is it introduced? Possibly one of
(Smith, Alvarez, and Lawrence 2018; O’Callaghan and Ramos 2011; Murray-Smith and Pearlmutter 2005).

### Composing kernels

A sum or product (or outer sum, or tensor product) of kernels is still a kernel. For other transforms YMMV.

For example, in the case of Gaussian processes, suppose that, independently,

\[\begin{aligned} f_{1} &\sim \mathcal{GP}\left(\mu_{1}, k_{1}\right)\\ f_{2} &\sim \mathcal{GP}\left(\mu_{2}, k_{2}\right) \end{aligned}\] then

\[ f_{1}+f_{2} \sim \mathcal{GP} \left(\mu_{1}+\mu_{2}, k_{1}+k_{2}\right) \] so \(k_{1}+k_{2}\) is also a kernel.

More generally, if \(k_{1}\) and \(k_{2}\) are two kernels, and \(c_{1}\), and \(c_{2}\) are two positive real numbers, then:

\[ K(x, x')=c_{1} k_{1}(x, x')+c_{2} k_{2}(x, x') \] is again a kernel. What with the multiplication as well, we note that all polynomials of kernels where the coefficients are positive are in turn kernels. (Genton 2002)

Note that the additivity in terms of kernels is not the same as additivity
in terms of induced feature spaces.
The induced feature map of \(k_{1}+k_{2}\) is their *concatenation* rather
than their sum.
Suppose \(\phi_{1}(x)\) gives us the feature map of \(k_{1}\) for \(x\) and likewise
\(\phi_{2}(x)\).

\[\begin{aligned} k_{1}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x') \\ k_{2}(x, x') &=\phi_{2}(x)^{\top} \phi_{2}(x')\\ k_{1}(x, x')+k_{2}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x')+\phi_{2}(x)^{\top} \phi_{2}(x')\\ &=\left[\begin{array}{c}{\phi_{1}(x)} \\ {\phi_{2}(x)}\end{array}\right]^{\top} \left[\begin{array}{c}{\phi_{1}(x')} \\ {\phi_{2}(x')}\end{array}\right] \end{aligned}\]

If \(k_{y}:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\) is a kernel and \(\psi: \mathcal{X}\to\mathcal{Y}\) this is also a kernel

\[\begin{aligned} k_{x}:&\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\ & (x,x')\mapsto k_{y}(\psi(x), \psi(x')) \end{aligned}\]

which apparently is now called a *deep kernel*.

Also if \(A\) is a positive definite operator, then of course it defines a kernel \(k_A(x,x'):=x^{\top}Ax'\)

(Genton 2002) uses the properties of covariance to construct some other nifty ones:

Let \(h:\mathcal{X}\to\mathbb{R}^{+}\) have minimum at 0. Then, using the identity for RVs

\[ \mathop{\textrm{Cov}}\left(Y_{1}, Y_{2}\right)=\left[\mathop{\textrm{Var}}\left(Y_{1}+Y_{2}\right)-\mathop{\textrm{Var}}\left(Y_{1}-Y_{2}\right)\right] / 4 \]

we find that the following is a kernel

\[ K(x, x')=\frac{1}{4}[h(x+x')-h(x-x')] \]

All these to various cunning combination strategies, which I will likely return to discuss. 🏗 Some of them are in the references. For example (Duvenaud et al. 2013) position their work in the wider field.

There is a large body of work attempting to construct a rich kernel through a weighted sum of base kernels (e.g. (Bach 2008; Christoudias, Urtasun, and Darrell 2009)). While these approaches find the optimal solution in polynomial time, speed comes at a cost: the component kernels, as well as their hyperparameters, must be specified in advance

…(Hinton and Salakhutdinov 2008) use a deep neural network to learn an embedding; this is a flexible approach to kernel learning but relies upon finding structure in the input density, p(x). Instead we focus on domains where most of the interesting structure is in f(x).

(Wilson and Adams 2013) derive kernels of the form \(SE × \cos(x − x_0\)), forming a basis for stationary kernels. These kernels share similarities with \(SE × Per\) but can express negative prior correlation, and could usefully be included in our grammar.

See (Grosse et al. 2012) for a mind-melting compositional matrix factorization diagram, constructing a search over hierarchical kernel decompositions.

Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.

### Stationary spectral kernels

(Sun et al. 2018; Bochner 1959; Kom Samo and Roberts 2015; Yaglom 1987) construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary using Bochner’s theorem

Bochner’s theorem: A complex-valued function \(K\) on \(\mathbb{R}^{d}\) is the covariance function of a weakly stationary mean square continuous complex-valued random process on \(\mathbb{R}^{d}\) if and only if it can be represented as

\[ K(\boldsymbol{\tau})=\int_{\mathbb{R}^{P}} \exp \left(2 \pi i \boldsymbol{w}^{\top} \boldsymbol{\tau}\right) \psi(\mathrm{d} \boldsymbol{w}) \] where \(\psi\) is a positive and finite measure. If \(\psi\) has a density \(S(\boldsymbol{w})\), then \(S\) is called the spectral density or power spectrum of \(K\), i.e. \(S\) and \(K\) are Fourier duals.

### Non-stationary spectral kernels

(Sun et al. 2018; Remes, Heinonen, and Kaski 2017; Kom Samo and Roberts 2015) use a generalised Bochner Theorem (Yaglom 1987) which does not presume anything about stationarity:

A complex-valued bounded continuous function \(K\) on \(\mathbb{R}^{d}\) is the covariance function of a mean square continuous complex-valued random process on \(\mathbb{R}^{d}\) if and only if it can be represented as

\[ K(\boldsymbol{s}, \boldsymbol{t})=\int_{\mathbb{R}^{d} \times \mathbb{R}^{d}} e^{2 \pi i\left(\boldsymbol{w}_{1}^{\top} \boldsymbol{s}-\boldsymbol{w}_{2}^{\top} \boldsymbol{t}\right)} \psi\left(\mathrm{d} \boldsymbol{w}_{\mathbf{1}}, \mathrm{d} \boldsymbol{w}_{\mathbf{2}}\right) \]

This is clearly more general, but it is not immediately clear how to use this extra potential; spectral representations are not an intuitive way of constructing things.

### Locally stationary

(Genton 2002) defines these are kernels that have a particular structure, specifically, a kernel that can be factored into a stationary kernel (_2) and a non negative function \(K_1\) in the following way:

\[ K(\mathbf{s}, \mathbf{t})=K_{1}\left(\frac{\mathbf{s}+\mathbf{t}}{2}\right) K_{2}(\mathbf{s}-\mathbf{t}) \]

Global structure then depends on the mean location \(\frac{\mathbf{s}+\mathbf{t}}{2}\). (Genton 2002) describes some nifty spectral properties of these kernels.

Other constructions might vie for the title of “locally stationary”. To check. 🏗

### Genton kernels

That’s my name for them because they seem to originate in (Genton 2002).

For any non-negative function \(h:\mathcal{T}\to\mathbb{R}^+\) with \(h(\mathbf{0})=0,\) the following is a kernel:

\[ K(\mathbf{s}, \mathbf{t})=\frac{1}{4}[h(\mathbf{s}+\mathbf{t})-h(\mathbf{s}-\mathbf{t}) \] Genton gives the example of \(h:s,t\mapsto s^\top t.\)

The motivation is the identity

\[ \operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \left[\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)\right] / 4. \]

### Compactly supported

We usually think about these n the stationary isotropic case, c where we mean kernels that vanish whenever the distance between two observation \(s,t\) is larger than a certain cut-off distance \(L,\) i.e. \(\|s-t\|>L\Rightarrow K(s,t)=0\). These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without very good reason, and I imagine this carries over to the analogous problem of covariance kernel support?

Despite feeling that qualm intuitively myself I think there are good cases for this kind of kernel; for example, in a hierarchical kernels you can get correlation between kernels of disjoint support.

(Genton 2002) mentions

\[ \max \left\{\left(1-\frac{\|\mathbf{s}-\mathbf{t}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\} \] and handballs us to (Gneiting 2002) for a bigger smörgåsbord of stationary compactly supported kernels. Gneiting has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly about producing compactly supported kernels via clever integral transforms.

### Kernels with desired symmetry

(Duvenaud 2014, chap. 2) summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🏗 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.

## Learning kernels

This is usually in the context of Gaussian processes where everything can work out nicely if you are lucky. The goal for all these seems to be to maximise the marginal posterior likelihood, a.k.a. model evidence, which will be familiar from every Bayesian ML method ever.

### Learning kernel hyperparameters

🏗

### Learning kernel composition

Automating kernel design by some composition of simpler atomic kernels. AFAICT this started from summaries like (Genton 2002) and went via Duvenaud’s aforementioned notes to became a small industry (Lloyd et al. 2014; Duvenaud, Nickisch, and Rasmussen 2011; Duvenaud et al. 2013; Grosse et al. 2012). A prominent example was the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues, which works by greedy combinatorial search over possible compositions.

More fashionable, presumably, are the differentiable search methods. For example, the AutoGP system (Krauth et al. 2016; Bonilla, Krauth, and Dezfouli 2016) incorporates tricks like these to use gradient descent to design kernels for Gaussian processes. (Sun et al. 2018) construct deep networks of composed kernels. I imagine the Deep Gaussian Process literature is also of this kind, but have not read it.

## Non-positive kernels

As in, kernels which are not positive-definite. (Ong et al. 2004) 🏗

Abrahamsen, Petter. 1997. “A Review of Gaussian Random Fields and Correlation Functions.” http://publications.nr.no/publications.nr.no/directdownload/publications.nr.no/rask/old/917_Rapport.pdf.

Agarwal, Arvind, and Hal Daumé Iii. 2011. “Generative Kernels for Exponential Families.” In *Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics*, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.

Aronszajn, N. 1950. “Theory of Reproducing Kernels.” *Transactions of the American Mathematical Society* 68 (3): 337–404. https://doi.org/10.2307/1990404.

Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. “Kernels for Vector-Valued Functions: A Review.” *Foundations and Trends® in Machine Learning* 4 (3): 195–266. https://doi.org/10.1561/2200000036.

Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In *Proceedings of the 21st International Conference on Neural Information Processing Systems*, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.

Bacry, Emmanuel, and Jean-François Muzy. 2016. “First- and Second-Order Statistics Characterization of Hawkes Processes and Non-Parametric Estimation.” *IEEE Transactions on Information Theory* 62 (4): 2184–2202. https://doi.org/10.1109/TIT.2016.2533397.

Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. “The Mondrian Kernel,” June. http://arxiv.org/abs/1606.05241.

Bochner, Salomon. 1959. *Lectures on Fourier Integrals*. Princeton University Press. http://books.google.com?id=MWCYDwAAQBAJ.

Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2016. “Generic Inference in Latent Gaussian Process Models,” September. http://arxiv.org/abs/1609.00577.

Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. “Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html.

Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. “Rational Kernels: Theory and Algorithms.” *Journal of Machine Learning Research* 5 (December): 1035–62. http://dl.acm.org/citation.cfm?id=1005332.1016793.

Cressie, Noel, and Hsin-Cheng Huang. 1999. “Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” *Journal of the American Statistical Association* 94 (448): 1330–9. https://doi.org/10.1080/01621459.1999.10473885.

Delft, Anne van, and Michael Eichler. 2016. “Locally Stationary Functional Time Series,” February. http://arxiv.org/abs/1602.05125.

Duvenaud, David. 2014. “Automatic Model Construction with Gaussian Processes.” PhD Thesis, University of Cambridge. https://github.com/duvenaud/phd-thesis.

Duvenaud, David K., Hannes Nickisch, and Carl E. Rasmussen. 2011. “Additive Gaussian Processes.” In *Advances in Neural Information Processing Systems*, 226–34. http://papers.nips.cc/paper/4221-additive-gaussian-processes.pdf.

Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In *Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.

Genton, Marc G. 2002. “Classes of Kernels for Machine Learning: A Statistics Perspective.” *Journal of Machine Learning Research* 2 (March): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.

Girolami, Mark, and Simon Rogers. 2005. “Hierarchic Bayesian Models for Kernel Learning.” In *Proceedings of the 22nd International Conference on Machine Learning - ICML ’05*, 241–48. Bonn, Germany: ACM Press. https://doi.org/10.1145/1102351.1102382.

Gneiting, Tilmann. 2002. “Compactly Supported Correlation Functions.” *Journal of Multivariate Analysis* 83 (2): 493–508. https://doi.org/10.1006/jmva.2001.2056.

Gneiting, Tilmann, William Kleiber, and Martin Schlather. 2010. “Matérn Cross-Covariance Functions for Multivariate Random Fields.” *Journal of the American Statistical Association* 105 (491): 1167–77. https://doi.org/10.1198/jasa.2010.tm09420.

Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In *Proceedings of the Conference on Uncertainty in Artificial Intelligence*. http://arxiv.org/abs/1210.4856.

Hartikainen, J., and S. Särkkä. 2010. “Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In *2010 IEEE International Workshop on Machine Learning for Signal Processing*, 379–84. Kittila, Finland: IEEE. https://doi.org/10.1109/MLSP.2010.5589113.

Hawkes, Alan G. 1971. “Spectra of Some Self-Exciting and Mutually Exciting Point Processes.” *Biometrika* 58 (1): 83–90. https://doi.org/10.1093/biomet/58.1.83.

Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. “Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In *Advances in Neural Information Processing Systems 20*, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc. http://papers.nips.cc/paper/3211-using-deep-belief-nets-to-learn-covariance-kernels-for-gaussian-processes.pdf.

Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. “Kernel Methods in Machine Learning.” *The Annals of Statistics* 36 (3): 1171–1220. https://doi.org/10.1214/009053607000000677.

Kom Samo, Yves-Laurent, and Stephen Roberts. 2015. “Generalized Spectral Kernels,” June. http://arxiv.org/abs/1506.02236.

Kondor, Risi, and Tony Jebara. 2006. “Gaussian and Wishart Hyperkernels.” In *Proceedings of the 19th International Conference on Neural Information Processing Systems*, 729–36. NIPS’06. Canada: MIT Press. http://dl.acm.org/citation.cfm?id=2976456.2976548.

Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In *UAI17*. http://arxiv.org/abs/1610.05392.

Lawrence, Neil. 2005. “Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” *Journal of Machine Learning Research* 6 (Nov): 1783–1816. http://www.jmlr.org/papers/v6/lawrence05a.html.

Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In *Twenty-Eighth AAAI Conference on Artificial Intelligence*. http://arxiv.org/abs/1402.4304.

Mercer, J. 1909. “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character* 209 (441-458): 415–46. https://doi.org/10.1098/rsta.1909.0016.

Micchelli, Charles A., and Massimiliano Pontil. 2005a. “Learning the Kernel Function via Regularization.” *Journal of Machine Learning Research* 6 (Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.

———. 2005b. “On Learning Vector-Valued Functions.” *Neural Computation* 17 (1): 177–204. https://doi.org/10.1162/0899766052530802.

Minasny, Budiman, and Alex. B. McBratney. 2005. “The Matérn Function as a General Model for Soil Variograms.” *Geoderma*, Pedometrics 2003, 128 (3–4): 192–207. https://doi.org/10.1016/j.geoderma.2005.04.003.

Murphy, Kevin P. 2012. *Machine Learning: A Probabilistic Perspective*. 1 edition. Cambridge, MA: The MIT Press.

Murray-Smith, Roderick, and Barak A. Pearlmutter. 2005. “Transformations of Gaussian Process Priors.” In *Deterministic and Statistical Methods in Machine Learning*, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg. http://bcl.hamilton.ie/~barak/papers/MLW-Jul-2005.pdf.

O’Callaghan, Simon Timothy, and Fabio T. Ramos. 2011. “Continuous Occupancy Mapping with Integral Kernels.” In *Twenty-Fifth AAAI Conference on Artificial Intelligence*. https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3784.

Ong, Cheng Soon, Xavier Mary, Stéphane Canu, and Alexander J. Smola. 2004. “Learning with Non-Positive Kernels.” In *Twenty-First International Conference on Machine Learning - ICML ’04*, 81. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015443.

Ong, Cheng Soon, and Alexander J. Smola. 2003. “Machine Learning Using Hyperkernels.” In *Proceedings of the Twentieth International Conference on International Conference on Machine Learning*, 568–75. ICML’03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041910.

Ong, Cheng Soon, Alexander J. Smola, and Robert C. Williamson. 2002. “Hyperkernels.” In *Proceedings of the 15th International Conference on Neural Information Processing Systems*, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2968618.2968680.

———. 2005. “Learning the Kernel with Hyperkernels.” *Journal of Machine Learning Research* 6 (Jul): 1043–71. http://www.jmlr.org/papers/v6/ong05a.html.

Pérez-Abreu, Víctor, and Alfonso Rocha-Arteaga. 2005. “Covariance-Parameter Lévy Processes in the Space of Trace-Class Operators.” *Infinite Dimensional Analysis, Quantum Probability and Related Topics* 08 (01): 33–54. https://doi.org/10.1142/S0219025705001846.

Pfaffel, Oliver. 2012. “Wishart Processes,” January. http://arxiv.org/abs/1201.3256.

Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. “SimpleMKL.” *Journal of Machine Learning Research* 9 (Nov): 2491–2521. http://www.jmlr.org/papers/v9/rakotomamonjy08a.html.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. *Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.

Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. “Non-Stationary Spectral Kernels.” In *Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc. http://papers.nips.cc/paper/7050-non-stationary-spectral-kernels.pdf.

Särkkä, Simo, and Jouni Hartikainen. 2012. “Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In *Artificial Intelligence and Statistics*. http://www.jmlr.org/proceedings/papers/v22/sarkka12.html.

Särkkä, Simo, A. Solin, and J. Hartikainen. 2013. “Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” *IEEE Signal Processing Magazine* 30 (4): 51–61. https://doi.org/10.1109/MSP.2013.2246292.

Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. “A Generalized Representer Theorem.” In *Computational Learning Theory*, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44581-1.

Schölkopf, Bernhard, and Alexander J. Smola. 2002. *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.

———. 2003. “A Short Introduction to Learning with Kernels.” In *Advanced Lectures on Machine Learning*, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.

Sinha, Aman, and John C Duchi. 2016. “Learning Kernels with Random Features.” In *Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1298–1306. Curran Associates, Inc. http://papers.nips.cc/paper/6180-learning-kernels-with-random-features.pdf.

Smith, Michael Thomas, Mauricio A. Alvarez, and Neil D. Lawrence. 2018. “Gaussian Process Regression for Binned Data,” September. http://arxiv.org/abs/1809.02010.

Stein, Michael L. 2005. “Space-Time Covariance Functions.” *Journal of the American Statistical Association* 100 (469): 310–21. https://doi.org/10.1198/016214504000000854.

Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” *arXiv Preprint arXiv:1806.04326*.

Székely, Gábor J., and Maria L. Rizzo. 2009. “Brownian Distance Covariance.” *The Annals of Applied Statistics* 3 (4): 1236–65. https://doi.org/10.1214/09-AOAS312.

Vedaldi, A., and A. Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 34 (3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.

Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. “A Primer on Kernel Methods.” In *Kernel Methods in Computational Biology*. MIT Press. http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2549.pdf.

Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. “Graph Kernels.” *Journal of Machine Learning Research* 11 (August): 1201–42. http://authors.library.caltech.edu/20528/1/Vishwanathan2010p11646J_Mach_Learn_Res.pdf.

Wilk, Mark van der, Andrew G. Wilson, and Carl E. Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In *NIPS 2014 Workshop on Advances in Variational Inference*.

Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In *International Conference on Machine Learning*. http://arxiv.org/abs/1302.4245.

Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. “The Human Kernel,” October. http://arxiv.org/abs/1510.07389.

Wilson, Andrew Gordon, and Zoubin Ghahramani. 2011. “Generalised Wishart Processes.” In *Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence*, 736–44. UAI’11. Barcelona, Spain: AUAI Press. http://dl.acm.org/citation.cfm?id=3020548.3020633.

———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In *Machine Learning and Knowledge Discovery in Databases*, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

Wu, Zongmin. 1995. “Compactly Supported Positive Definite Radial Functions.” *Advances in Computational Mathematics* 4 (1): 283–92. https://doi.org/10.1007/BF03177517.

Yaglom, A. M. 1987. *Correlation Theory of Stationary and Related Random Functions: Supplementary Notes and References*. Springer Series in Statistics. New York, NY: Springer Science & Business Media.

Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. “Characterizing the Representer Theorem.” In *Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.

Zhang, Aonan, and John Paisley. 2019. “Random Function Priors for Correlation Modeling.” In *International Conference on Machine Learning*, 7424–33. http://proceedings.mlr.press/v97/zhang19k.html.