Combining kernels



A sum or product (or outer sum, or tensor product) of Mercer kernels is still a kernel. For other transforms YMMV.

\[ \begin{aligned} K(x, y) &=K_{1}(x, y)+K_{2}(x, y) \\ K(x, y) &=c K_{1}(x, y)+K_{2}(x, y) \text { for } c \in \mathbb{R}_{+} \\ K(x, y) &=K_{1}(x, y)+c \text { for } c \in \mathbb{R}_{+} \\ K(x, y) &=K_{1}(x, y) K_{2}(x, y) \\ K(x, y) &=f(x) f(y) \text { for } f: \mathcal{X} \longrightarrow \mathbb{R} \\ K(x, y) &=\left(K_{1}(x, y)+c\right)^{d} \text { for } \theta_{1} \in \mathbb{R}_{+} \text {and } d \in \mathbb{N} \\ K(x, y) &=\exp \left(K_{1}(x, y) / \sigma^{2}\right) \text { for } \sigma \in \mathbb{R} \\ K(x, y) &=\exp \left(-\left(K_{1}(x, x)-2 K_{1}(x, y)+K_{1}(y, y)\right) / 2 \sigma^{2}\right) \\ K(x, y) &=K_{1}(x, y) / \sqrt{K_{1}(x, x) K_{1}(y, y)} \end{aligned} \]

For example, in the case of Gaussian processes, suppose that, independently,

\[\begin{aligned} f_{1} &\sim \mathcal{GP}\left(\mu_{1}, k_{1}\right)\\ f_{2} &\sim \mathcal{GP}\left(\mu_{2}, k_{2}\right) \end{aligned}\] then

\[ f_{1}+f_{2} \sim \mathcal{GP} \left(\mu_{1}+\mu_{2}, k_{1}+k_{2}\right) \] so \(k_{1}+k_{2}\) is also a kernel.

More generally, if \(k_{1}\) and \(k_{2}\) are two kernels, and \(c_{1}\), and \(c_{2}\) are two positive real numbers, then:

\[ K(x, x')=c_{1} k_{1}(x, x')+c_{2} k_{2}(x, x') \] is again a kernel. What with the multiplication as well, we note that all polynomials of kernels where the coefficients are positive are in turn kernels. (Genton 2001)

Note that the additivity in terms of kernels is not the same as additivity in terms of induced feature spaces. The induced feature map of \(k_{1}+k_{2}\) is their concatenation rather than their sum. Suppose \(\phi_{1}(x)\) gives us the feature map of \(k_{1}\) for \(x\) and likewise \(\phi_{2}(x)\).

\[\begin{aligned} k_{1}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x') \\ k_{2}(x, x') &=\phi_{2}(x)^{\top} \phi_{2}(x')\\ k_{1}(x, x')+k_{2}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x')+\phi_{2}(x)^{\top} \phi_{2}(x')\\ &=\left[\begin{array}{c}{\phi_{1}(x)} \\ {\phi_{2}(x)}\end{array}\right]^{\top} \left[\begin{array}{c}{\phi_{1}(x')} \\ {\phi_{2}(x')}\end{array}\right] \end{aligned}\]

If \(k:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\) is a kernel and \(\psi: \mathcal{X}\to\mathcal{Y}\) this is also a kernel

\[\begin{aligned} k_{\psi}:&\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\ & (x,x')\mapsto k_{y}(\psi(x), \psi(x')) \end{aligned}\]

which apparently is now called a deep kernel. if \(k\) is stationary and \(\psi\) is invertible then this is a stationary reducible kernel.

Also if \(A\) is a positive definite operator, then of course it defines a kernel \(k_A(x,x'):=x^{\top}Ax'\)

(Genton 2001) uses the properties of covariance to construct some other nifty ones:

Let \(h:\mathcal{X}\to\mathbb{R}^{+}\) have minimum at 0. Then, using the identity for RVs

\[ \mathop{\textrm{Cov}}\left(Y_{1}, Y_{2}\right)=\left[\mathop{\textrm{Var}}\left(Y_{1}+Y_{2}\right)-\mathop{\textrm{Var}}\left(Y_{1}-Y_{2}\right)\right] / 4 \]

we find that the following is a kernel

\[ K(x, x')=\frac{1}{4}[h(x+x')-h(x-x')] \]

All these to various cunning combination strategies, which I may return to discuss. 🏗 Some of them are in the references. For example (Duvenaud et al. 2013) position their work in the wider field:

There is a large body of work attempting to construct a rich kernel through a weighted sum of base kernels (e.g. (Bach 2008; Christoudias, Urtasun, and Darrell 2009)). While these approaches find the optimal solution in polynomial time, speed comes at a cost: the component kernels, as well as their hyperparameters, must be specified in advance […]

(Hinton and Salakhutdinov 2008) use a deep neural network to learn an embedding; this is a flexible approach to kernel learning but relies upon finding structure in the input density, p(x). Instead we focus on domains where most of the interesting structure is in f(x).

(Wilson and Adams 2013) derive kernels of the form \(SE × \cos(x − x_0\)), forming a basis for stationary kernels. These kernels share similarities with \(SE × Per\) but can express negative prior correlation, and could usefully be included in our grammar.

See (Grosse et al. 2012) for a mind-melting compositional matrix factorization diagram, constructing a search over hierarchical kernel decompositions.

Exploiting compositionality to explore a large space of model structures

Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.

Locally stationary kernels

(Genton 2001) defines these as kernels that have a particular structure, specifically, a kernel that can be factored into a stationary kernel \(K_2\) and a non negative function \(K_1\) in the following way:

\[ K(\mathbf{s}, \mathbf{t})=K_{1}\left(\frac{\mathbf{s}+\mathbf{t}}{2}\right) K_{2}(\mathbf{s}-\mathbf{t}) \]

Global structure then depends on the mean location \(\frac{\mathbf{s}+\mathbf{t}}{2}\). (Genton 2001) describes some nifty spectral properties of these kernels.

Other constructions might vie for the title of “locally stationary”. To check. 🏗

Stationary reducible kernels

See kernel warping.

Other nonstationary kernels

See nonstationary kernels.

References

Agarwal, Arvind, and Hal Daumé Iii. 2011. “Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.
Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.
Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. “Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html.
Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.
Genton, Marc G. 2001. “Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research 2 (December): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.
Genton, Marc G., and Olivier Perrin. 2004. “On a Time Deformation Reducing Nonstationary Stochastic Processes to Local Stationarity.” Journal of Applied Probability 41 (1): 236–49. https://doi.org/10.1239/jap/1077134681.
Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence. http://arxiv.org/abs/1210.4856.
Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. “Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc. http://papers.nips.cc/paper/3211-using-deep-belief-nets-to-learn-covariance-kernels-for-gaussian-processes.pdf.
Ikeda, Masahiro, Isao Ishikawa, and Yoshihiro Sawano. 2021. “Composition Operators on Reproducing Kernel Hilbert Spaces with Analytic Positive Definite Functions.” arXiv:1911.11992 [math, Stat], March. http://arxiv.org/abs/1911.11992.
Paciorek, Christopher J., and Mark J. Schervish. 2003. “Nonstationary Covariance Functions for Gaussian Process Regression.” In Proceedings of the 16th International Conference on Neural Information Processing Systems, 16:273–80. NIPS’03. Cambridge, MA, USA: MIT Press. https://papers.nips.cc/paper/2003/hash/326a8c055c0d04f5b06544665d8bb3ea-Abstract.html.
Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. “SimpleMKL.” Journal of Machine Learning Research 9 (Nov): 2491–521. http://www.jmlr.org/papers/v9/rakotomamonjy08a.html.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.
Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. “Non-Stationary Spectral Kernels.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc. http://papers.nips.cc/paper/7050-non-stationary-spectral-kernels.pdf.
Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning. http://arxiv.org/abs/1302.4245.
Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2016. “Deep Kernel Learning.” In Artificial Intelligence and Statistics, 370–78. PMLR. http://proceedings.mlr.press/v51/wilson16.html.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.