Learning covariance functions

Learning a family of covariances at once

September 16, 2019 — March 1, 2021

Figure 1

The generalisation of covariance matrix estimation to the case of continuous index sets. This is often seen in the context of Gaussian processes where everything can work out nicely if we are lucky.

1 Selecting parametric kernel by maximising marginal likelihood

The goal for most of these is to maximise the marginal posterior likelihood, a.k.a. model evidence, as is conventional in Bayesian ML. But we could also apply hyperpriors to kernels.

2 Learning kernel composition

Automating kernel design by some composition of simpler atomic kernels. AFAICT this started from summaries like (Genton 2001) and went via Duvenaud’s aforementioned notes to became a small industry (Lloyd et al. 2014; D. K. Duvenaud, Nickisch, and Rasmussen 2011; D. Duvenaud et al. 2013; Grosse et al. 2012). A prominent example was the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues, which works by greedy combinatorial search over possible compositions.

More fashionable, presumably, are the differentiable search methods. For example, the AutoGP system (Krauth et al. 2016; Bonilla, Krauth, and Dezfouli 2019) incorporates tricks like these to use gradient descent to design kernels for Gaussian processes. (Sun et al. 2018) construct deep networks of composed kernels. I imagine the Deep Gaussian Process literature is also of this kind, but have not read it.

3 Via neural nets

🏗

4 Hyperkernels

Kernels on kernels, for kernel learning kernels 🏗 (Ong, Smola, and Williamson 2005, 2002; Ong and Smola 2003; Kondor and Jebara 2006).

5 References

Álvarez, Rosasco, and Lawrence. 2012. Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning.
Bach. 2008. Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08.
Balog, Lakshminarayanan, Ghahramani, et al. 2016. The Mondrian Kernel.” arXiv:1606.05241 [Stat].
Bohn, Griebel, and Rieger. 2018. A Representer Theorem for Deep Kernel Learning.” arXiv:1709.10441 [Cs, Math].
Bonilla, Krauth, and Dezfouli. 2019. Generic Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research.
Christoudias, Urtasun, and Darrell. 2009. Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96.
Cortes, Haffner, and Mohri. 2004. Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research.
Duvenaud, David, Lloyd, Grosse, et al. 2013. Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Duvenaud, David K., Nickisch, and Rasmussen. 2011. Additive Gaussian Processes.” In Advances in Neural Information Processing Systems.
Genton. 2001. Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research.
Girolami, and Rogers. 2005. Hierarchic Bayesian Models for Kernel Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05.
Grosse, Salakhutdinov, Freeman, et al. 2012. Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Hartikainen, and Särkkä. 2010. Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In 2010 IEEE International Workshop on Machine Learning for Signal Processing.
Hofmann, Schölkopf, and Smola. 2008. Kernel Methods in Machine Learning.” The Annals of Statistics.
Kom Samo, and Roberts. 2015. Generalized Spectral Kernels.” arXiv:1506.02236 [Stat].
Kondor, and Jebara. 2006. Gaussian and Wishart Hyperkernels.” In Proceedings of the 19th International Conference on Neural Information Processing Systems. NIPS’06.
Krauth, Bonilla, Cutajar, et al. 2016. AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.
Lawrence. 2005. Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research.
Lloyd, Duvenaud, Grosse, et al. 2014. Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Micchelli, and Pontil. 2005a. Learning the Kernel Function via Regularization.” Journal of Machine Learning Research.
———. 2005b. On Learning Vector-Valued Functions.” Neural Computation.
Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.
Murray-Smith, and Pearlmutter. 2005. Transformations of Gaussian Process Priors.” In Deterministic and Statistical Methods in Machine Learning. Lecture Notes in Computer Science.
O’Callaghan, and Ramos. 2011. Continuous Occupancy Mapping with Integral Kernels.” In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Ong, Mary, Canu, et al. 2004. Learning with Non-Positive Kernels.” In Twenty-First International Conference on Machine Learning - ICML ’04.
Ong, and Smola. 2003. Machine Learning Using Hyperkernels.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.
Ong, Smola, and Williamson. 2002. Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
———. 2005. Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research.
Rakotomamonjy, Bach, Canu, et al. 2008. SimpleMKL.” Journal of Machine Learning Research.
Rasmussen, and Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning.
Remes, Heinonen, and Kaski. 2018. Neural Non-Stationary Spectral Kernel.” arXiv:1811.10978 [Cs, Stat].
Saha, and Balamurugan. 2020. Learning with Operator-Valued Kernels in Reproducing Kernel Krein Spaces.” In Advances in Neural Information Processing Systems.
Särkkä, Solin, and Hartikainen. 2013. Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine.
Schölkopf, Herbrich, and Smola. 2001. A Generalized Representer Theorem.” In Computational Learning Theory. Lecture Notes in Computer Science.
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
———. 2003. A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.
Sinha, and Duchi. 2016. Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29.
Sun, Zhang, Wang, et al. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Uziel. 2020. Nonparametric Sequential Prediction While Deep Learning the Kernel.” In International Conference on Artificial Intelligence and Statistics.
Vert, Tsuda, and Schölkopf. 2004. A Primer on Kernel Methods.” In Kernel Methods in Computational Biology.
Vishwanathan, Schraudolph, Kondor, et al. 2010. Graph Kernels.” Journal of Machine Learning Research.
Wilson, and Adams. 2013. Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning.
Wilson, Dann, Lucas, et al. 2015. The Human Kernel.” arXiv:1510.07389 [Cs, Stat].
Wilson, and Ghahramani. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Wilson, Hu, Salakhutdinov, et al. 2016. Deep Kernel Learning.” In Artificial Intelligence and Statistics.
Yu, Cheng, Schuurmans, et al. 2013. Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Zhang, and Paisley. 2019. Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning.