(Reproducing) kernel tricks

WARNING: This is very old. If I were to write it now, I would write it differently. I might break apart kernel tricks from kernels and I might wonder when we need a countable Mercer-style kernel decomposition and when we can do without.

Kernel in the sense of the “kernel trick”. Not to be confused with smoothing-type convolution kernels, nor the dozens of related-but-slightly-different clashing definitions of kernel; those can have their own respective pages. Corollary: If you do not know what to name something, call it a kernel.

We are concerned with a particular flavour of kernel herein Hilbert spaces, specifically reproducing or Mercer kernels (Mercer 1909). The associated function space is a reproducing Kernel Hilbert Space, which is hereafter an RKHS.

Kernel tricks comprise the application of Mercer kernels in Machine Learning. The “trick” part is that many machine learning algorithms operate on inner products. Or can be rewritten to work that way. Such algorithms permit one to swap out a boring classic Euclidean definition of that inner product in favour of a fancy RKHS one. The classic machine learning pitch for trying such a stunt is something like “upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy algebra on potentially-infinite-dimensional feature spaces, which still has a low-dimensional representation.” Or, if you’d like, “apply statistical learning methods based on things with an obvious finite vector space representation (\(\mathbb{R}^n\)) to things without one (Sentences, piano-rolls, \(\mathcal{C}^d_\ell\)).”

Mini history: The oft-cited origins of all the reproducing kernel stuff are (Aronszajn 1950; Mercer 1909). It took a while to percolate into random function theory (Khintchine 1934; Yaglom 1987) as covariance functions. Thence the idea arrived in statistical inference (Emanuel. Parzen 1962; E. Parzen 1963, 1959) and signal processing (Aasnaes and Kailath 1973; Duttweiler and Kailath 1973a, 1973b; Gevers and Kailath 1973; T. Kailath and Geesey 1971, 1973; T. Kailath 1971b, 1971a, 1974; T. Kailath, Geesey, and Weinert 1972; T. Kailath and Duttweiler 1972; T. Kailath and Weinert 1975), and now it is ubiquitous.

Practically, kernel methods have problems with scalability to large data sets. To apply any such method you need to keep a full Gram matrix of inner products between every data point, which needs you to know, for \(N\) data points, \(N(N-1)/2\) entries of a symmetric matrix. If you need to invert that matrix the cost is\(\mathcal{O}(N^3)\), which means you need fancy tricks to handle large \(N\). Fancy tricks depend on what the actual model is, but include Sparse GPs, random-projection inversions, Markov approximations and presumably many more

I’m especially interested in the application of such tricks in

  1. kernel regression
  2. wide random NNs
  3. Nonparametric kernel independence tests
  4. Efficient kernel pre-image approximation
  5. Connection between kernel PCA and clustering (Schölkopf et al. 1998; Williams 2001) Turns out not all those applications are interesting to me.


Feature space

There are many primers on Mercer kernels and their connection to ML. Kenneth Tay’s intro is punchy. See (Schölkopf and Smola 2002), which grinds out many connections with learning theory, or (Manton and Amblard 2015), which is more narrowly focussed on just the Mercer-kernel part which emphasises topological and geometric properties of the spaces, or (Cheney and Light 2009) for an approximation-theory perspective which does not especially concern itself with stochastic processes. I also seem to have bookmarked the following introductions (Vert, Tsuda, and Schölkopf 2004; Schölkopf et al. 1999; Schölkopf, Herbrich, and Smola 2001; Muller et al. 2001; Schölkopf and Smola 2003).

Alex Smola (who with, Bernhard Schölkopf) has his name on an intimidating proportion of publications in this area, also has all his publications online.

Kernel approximation

See kernel approximation.

RKHS distribution embedding

See integral probability metrics.

Specific kernels

See covariance functions.

Non-scalar-valued “kernels”

Extending the usual inner-product framing, Operator-valued kernels, (Micchelli and Pontil 2005a; Evgeniou, Micchelli, and Pontil 2005; Álvarez, Rosasco, and Lawrence 2012), generalise to \(k:\mathcal{X}\times \mathcal{X}\mapsto \mathcal{L}(H_Y)\), as seen in multi-task learning.


Aasnaes, H., and T. Kailath. 1973. “An Innovations Approach to Least-Squares Estimation–Part VII: Some Applications of Vector Autoregressive-Moving Average Models.” IEEE Transactions on Automatic Control 18 (6): 601–7. https://doi.org/10.1109/TAC.1973.1100412.
Agarwal, Arvind, and Hal Daumé Iii. 2011. “Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.
Alaoui, Ahmed El, and Michael W. Mahoney. 2014. “Fast Randomized Kernel Methods With Statistical Guarantees.” arXiv:1411.0306 [cs, Stat], November. http://arxiv.org/abs/1411.0306.
Altun, Yasemin, Alex J. Smola, and Thomas Hofmann. 2004. “Exponential Families for Conditional Random Fields.” In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2–9. UAI ’04. Arlington, Virginia, United States: AUAI Press. http://arxiv.org/abs/1207.4131.
Aronszajn, N. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68 (3): 337–404. https://doi.org/10.2307/1990404.
Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. “Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning 4 (3): 195–266. https://doi.org/10.1561/2200000036.
Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.
———. 2015. “On the Equivalence Between Kernel Quadrature Rules and Random Feature Expansions.” arXiv Preprint arXiv:1502.06800. http://arxiv.org/abs/1502.06800.
Bach, Francis R. 2013. “Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In COLT, 30:185–209. http://www.jmlr.org/proceedings/papers/v30/Bach13.pdf.
Backurs, Arturs, Piotr Indyk, and Ludwig Schmidt. 2017. “On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks.” arXiv:1704.02958 [cs, Stat], April. http://arxiv.org/abs/1704.02958.
Bakır, Gökhan H., Alexander Zien, and Koji Tsuda. 2004. “Learning to Find Graph Pre-Images.” In Pattern Recognition, edited by Carl Edward Rasmussen, Heinrich H. Bülthoff, Bernhard Schölkopf, and Martin A. Giese, 253–61. Lecture Notes in Computer Science 3175. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/978-3-540-28649-3_31.
Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. “The Mondrian Kernel.” arXiv:1606.05241 [stat], June. http://arxiv.org/abs/1606.05241.
Ben-Hur, Asa, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch. 2008. “Support Vector Machines and Kernels for Computational Biology.” PLoS Comput Biol 4 (10): e1000173. https://doi.org/10.1371/journal.pcbi.1000173.
Bosq, Denis, and Delphine Blanke. 2007. Inference and prediction in large dimensions. Wiley series in probability and statistics. Chichester, England ; Hoboken, NJ: John Wiley/Dunod.
Boyer, Claire, Antonin Chambolle, Yohann De Castro, Vincent Duval, Frédéric De Gournay, and Pierre Weiss. 2018. “On Representer Theorems and Convex Regularization.” arXiv:1806.09810 [cs, Math], June. http://arxiv.org/abs/1806.09810.
Brown, Lawrence D., and Yi Lin. 2004. “Statistical Properties of the Method of Regularization with Periodic Gaussian Reproducing Kernel.” The Annals of Statistics 32 (4): 1723–43. https://doi.org/10.1214/009053604000000454.
Burges, C. J. C. 1998. “Geometry and Invariance in Kernel Based Methods.” In Advances in Kernel Methods - Support Vector Learning, edited by Bernhard Schölkopf, Christopher JC Burges, and Alexander J Smola. Cambridge, MA: MIT Press. http://research.microsoft.com/en-us/um/people/cburges/papers/kernel_geometry_web_page.ps.gz.
Carrasco, Rafael C., Jose Oncina, and Jorge Calera-Rubio. 2001. “Stochastic Inference of Regular Tree Languages.” Machine Learning 44 (1-2): 185–97. https://doi.org/10.1023/A:1010836331703.
Chatfield, Ken, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. “The Devil Is in the Details: An Evaluation of Recent Feature Encoding Methods,” November. http://eprints.pascal-network.org/archive/00008315/.
Cheney, Elliott Ward, and William Allan Light. 2009. A Course in Approximation Theory. American Mathematical Soc. https://books.google.com.au/books?hl=en&lr=&id=II6DAwAAQBAJ&oi=fnd&pg=PA1&ots=ch9-LyxDg6&sig=jetWpIErExYvlnnSsup-5yHhso0.
Choromanski, Krzysztof, and Vikas Sindhwani. 2016. “Recycling Randomness with Structure for Sublinear Time Kernel Expansions.” arXiv:1605.09049 [cs, Stat], May. http://arxiv.org/abs/1605.09049.
Chwialkowski, Kacper, Heiko Strathmann, and Arthur Gretton. 2016. “A Kernel Test of Goodness of Fit.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 2606–15. ICML’16. New York, NY, USA: JMLR.org. http://arxiv.org/abs/1602.02964.
Clark, Alexander, Christophe Costa Florêncio, and Chris Watkins. 2006. “Languages as Hyperplanes: Grammatical Inference with String Kernels.” In Machine Learning: ECML 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 90–101. Lecture Notes in Computer Science 4212. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/11871842_13.
Clark, Alexander, Christophe Costa Florêncio, Chris Watkins, and Mariette Serayet. 2006. “Planar Languages and Learnability.” In Grammatical Inference: Algorithms and Applications, edited by Yasubumi Sakakibara, Satoshi Kobayashi, Kengo Sato, Tetsuro Nishino, and Etsuji Tomita, 148–60. Lecture Notes in Computer Science 4201. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/11872436_13.
Clark, Alexander, and Chris Watkins. 2008. “Some Alternatives to Parikh Matrices Using String Kernels.” Fundamenta Informaticae 84 (3): 291–303. http://iospress.metapress.com/content/J87918V884501713.
Collins, Michael, and Nigel Duffy. 2002. “Convolution Kernels for Natural Language.” In Advances in Neural Information Processing Systems 14, edited by T. G. Dietterich, S. Becker, and Z. Ghahramani, 625–32. MIT Press. http://papers.nips.cc/paper/2089-convolution-kernels-for-natural-language.pdf.
Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. “Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research 5 (December): 1035–62. http://dl.acm.org/citation.cfm?id=1005332.1016793.
Cucker, Felipe, and Steve Smale. 2002. “On the Mathematical Foundations of Learning.” Bulletin of the American Mathematical Society 39 (1): 1–49. https://doi.org/10.1090/S0273-0979-01-00923-5.
Cunningham, John P., Krishna V. Shenoy, and Maneesh Sahani. 2008. “Fast Gaussian Process Methods for Point Process Intensity Estimation.” In Proceedings of the 25th International Conference on Machine Learning, 192–99. ICML ’08. New York, NY, USA: ACM Press. https://doi.org/10.1145/1390156.1390181.
Curtain, Ruth F. 1975. “Infinite-Dimensional Filtering.” SIAM Journal on Control 13 (1): 89–104. https://doi.org/10.1137/0313005.
Danafar, Somayeh, Kenji Fukumizu, and Faustino Gomez. 2014. “Kernel-Based Information Criterion.” arXiv:1408.5810 [stat], August. http://arxiv.org/abs/1408.5810.
Devroye, Luc, László Györfi, and Gábor Lugosi. 1996. A Probabilistic Theory of Pattern Recognition. New York: Springer. http://www.szit.bme.hu/~gyorfi/pbook.pdf.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [cs, Stat], November. http://arxiv.org/abs/2012.00152.
Drineas, Petros, and Michael W. Mahoney. 2005. “On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research 6 (December): 2153–75. http://jmlr.org/papers/volume6/drineas05a/drineas05a.pdf.
Duttweiler, D., and T. Kailath. 1973a. “RKHS Approach to Detection and Estimation Problems–IV: Non-Gaussian Detection.” IEEE Transactions on Information Theory 19 (1): 19–28. https://doi.org/10.1109/TIT.1973.1054928.
———. 1973b. “RKHS Approach to Detection and Estimation Problems–V: Parameter Estimation.” IEEE Transactions on Information Theory 19 (1): 29–37. https://doi.org/10.1109/TIT.1973.1054949.
Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.
Evgeniou, Theodoros, Charles A. Micchelli, and Massimiliano Pontil. 2005. “Learning Multiple Tasks with Kernel Methods.” Journal of Machine Learning Research 6 (Apr): 615–37. http://www.jmlr.org/papers/v6/evgeniou05a.html.
Feragen, Aasa, and Søren Hauberg. n.d. “Open Problem: Kernel Methods on Manifolds and Metric Spaces,” 4.
FitzGerald, Derry, Antoine Liukus, Zafar Rafii, Bryan Pardo, and Laurent Daudet. 2013. “Harmonic/Percussive Separation Using Kernel Additive Modelling.” In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET, 35–40. IET. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6912726.
Flaxman, Seth, Yee Whye Teh, and Dino Sejdinovic. 2016. “Poisson Intensity Estimation with Reproducing Kernels.” arXiv:1610.08623 [stat], October. http://arxiv.org/abs/1610.08623.
Friedlander, B., T. Kailath, and L. Ljung. 1975. “Scattering Theory and Linear Least Squares Estimation: Part II: Discrete-Time Problems.” In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes, 57–58. https://doi.org/10.1109/CDC.1975.270648.
Genton, Marc G. 2001. “Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research 2 (December): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.
Gevers, M., and T. Kailath. 1973. “An Innovations Approach to Least-Squares Estimation–Part VI: Discrete-Time Innovations Representations and Recursive Estimation.” IEEE Transactions on Automatic Control 18 (6): 588–600. https://doi.org/10.1109/TAC.1973.1100419.
Globerson, Amir, and Roi Livni. 2016. “Learning Infinite-Layer Networks: Beyond the Kernel Trick.” arXiv:1606.05316 [cs], June. http://arxiv.org/abs/1606.05316.
Gorham, Jackson, Anant Raj, and Lester Mackey. 2020. “Stochastic Stein Discrepancies.” arXiv:2007.02857 [cs, Math, Stat], October. http://arxiv.org/abs/2007.02857.
Gottwald, Georg A., and Sebastian Reich. 2020. “Supervised Learning from Noisy Observations: Combining Machine-Learning Techniques with Data Assimilation.” arXiv:2007.07383 [physics, Stat], July. http://arxiv.org/abs/2007.07383.
Grauman, K., and T. Darrell. 2005. “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features.” In Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005, 2:1458–1465 Vol. 2. https://doi.org/10.1109/ICCV.2005.239.
Greengard, L., and J. Strain. 1991. “The Fast Gauss Transform.” SIAM Journal on Scientific and Statistical Computing 12 (1): 79–94. https://doi.org/10.1137/0912004.
Gretton, Arthur, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J Smola. 2008. “A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press. http://eprints.pascal-network.org/archive/00004335/.
Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence. http://arxiv.org/abs/1210.4856.
Haussler, David. 1999. “Convolution Kernels on Discrete Structures.” Technical report, UC Santa Cruz. http://ci.nii.ac.jp/naid/10015408231/.
Heinonen, Markus, and Florence d’Alché-Buc. 2014. “Learning Nonparametric Differential Equations with Operator-Valued Kernels and Gradient Matching.” arXiv:1411.5172 [cs, Stat], November. http://arxiv.org/abs/1411.5172.
Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. “Kernel methods in machine learning.” The Annals of Statistics 36 (3): 1171–1220. https://doi.org/10.1214/009053607000000677.
Ishikawa, Isao, Keisuke Fujii, Masahiro Ikeda, Yuka Hashimoto, and Yoshinobu Kawahara. 2018. “Metric on Nonlinear Dynamical Systems with Perron-Frobenius Operators.” arXiv:1805.12324 [cs, Math, Stat], October. http://arxiv.org/abs/1805.12324.
Jain, Brijnesh J. 2009. “Structure Spaces.” Journal of Machine Learning Research 10.
Jung, Alexander. 2013. “An RKHS Approach to Estimation with Sparsity Constraints.” In Advances in Neural Information Processing Systems 29. http://arxiv.org/abs/1311.5768.
Kailath, T. 1971a. “RKHS Approach to Detection and Estimation Problems–I: Deterministic Signals in Gaussian Noise.” IEEE Transactions on Information Theory 17 (5): 530–49. https://doi.org/10.1109/TIT.1971.1054673.
———. 1971b. “A Note on Least-Squares Estimation by the Innovations Method.” In 1971 IEEE Conference on Decision and Control, 407–11. https://doi.org/10.1109/CDC.1971.271027.
———. 1974. “A View of Three Decades of Linear Filtering Theory.” IEEE Transactions on Information Theory 20 (2): 146–81. https://doi.org/10.1109/TIT.1974.1055174.
Kailath, T., and D. Duttweiler. 1972. “An RKHS Approach to Detection and Estimation Problems– III: Generalized Innovations Representations and a Likelihood-Ratio Formula.” IEEE Transactions on Information Theory 18 (6): 730–45. https://doi.org/10.1109/TIT.1972.1054925.
Kailath, T., and R. Geesey. 1971. “An Innovations Approach to Least Squares Estimation–Part IV: Recursive Estimation Given Lumped Covariance Functions.” IEEE Transactions on Automatic Control 16 (6): 720–27. https://doi.org/10.1109/TAC.1971.1099835.
———. 1973. “An Innovations Approach to Least-Squares Estimation–Part V: Innovations Representations and Recursive Estimation in Colored Noise.” IEEE Transactions on Automatic Control 18 (5): 435–53. https://doi.org/10.1109/TAC.1973.1100366.
Kailath, T., R. Geesey, and H. Weinert. 1972. “Some Relations Among RKHS Norms, Fredholm Equations, and Innovations Representations.” IEEE Transactions on Information Theory 18 (3): 341–48. https://doi.org/10.1109/TIT.1972.1054827.
Kailath, T., and H. Weinert. 1975. “An RKHS Approach to Detection and Estimation Problems–II: Gaussian Signal Detection.” IEEE Transactions on Information Theory 21 (1): 15–23. https://doi.org/10.1109/TIT.1975.1055328.
Kailath, Thomas. 1971. “The Structure of Radon-Nikodym Derivatives with Respect to Wiener and Related Measures.” The Annals of Mathematical Statistics 42 (3): 1054–67.
Kanagawa, Motonobu, and Kenji Fukumizu. 2014. “Recovering Distributions from Gaussian RKHS Embeddings.” In Journal of Machine Learning Research. http://www.jmlr.org/proceedings/papers/v33/kanagawa14.pdf.
Kanagawa, Motonobu, Philipp Hennig, Dino Sejdinovic, and Bharath K. Sriperumbudur. 2018. “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences.” arXiv:1807.02582 [cs, Stat], July. http://arxiv.org/abs/1807.02582.
Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [cs, Stat], August. http://arxiv.org/abs/2006.16236.
Kemerait, R., and D. Childers. 1972. “Signal Detection and Extraction by Cepstrum Techniques.” IEEE Transactions on Information Theory 18 (6): 745–59. https://doi.org/10.1109/TIT.1972.1054926.
Keriven, Nicolas, Anthony Bourrier, Rémi Gribonval, and Patrick Pérez. 2016. “Sketching for Large-Scale Learning of Mixture Models.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6190–94. https://doi.org/10.1109/ICASSP.2016.7472867.
Khintchine, A. 1934. “Korrelationstheorie der stationären stochastischen Prozesse.” Mathematische Annalen 109 (1): 604–15. https://doi.org/10.1007/BF01449156.
Kimeldorf, George S., and Grace Wahba. 1970. “A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines.” The Annals of Mathematical Statistics 41 (2): 495–502. https://doi.org/10.1214/aoms/1177697089.
Kloft, Marius, Ulrich Rückert, and Peter L. Bartlett. 2010. “A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15883-4_5.
Klus, Stefan, Andreas Bittracher, Ingmar Schuster, and Christof Schütte. 2018. “A Kernel-Based Approach to Molecular Conformation Analysis.” The Journal of Chemical Physics 149 (24): 244109. https://doi.org/10.1063/1.5063533.
Kontorovich, Leonid (Aryeh), Corinna Cortes, and Mehryar Mohri. 2008. “Kernel Methods for Learning Languages.” Theoretical Computer Science, Algorithmic Learning Theory, 405 (3): 223–36. https://doi.org/10.1016/j.tcs.2008.06.037.
Kontorovich, Leonid, Corinna Cortes, and Mehryar Mohri. 2006. “Learning Linearly Separable Languages.” In Algorithmic Learning Theory, edited by José L. Balcázar, Philip M. Long, and Frank Stephan, 288–303. Lecture Notes in Computer Science 4264. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/11894841_24.
Koppel, Alec, Garrett Warnell, Ethan Stump, and Alejandro Ribeiro. 2016. “Parsimonious Online Learning with Kernels via Sparse Projections in Function Space.” arXiv:1612.04111 [cs, Stat], December. http://arxiv.org/abs/1612.04111.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In Uai17. http://arxiv.org/abs/1610.05392.
Kulis, Brian, and Kristen Grauman. 2012. “Kernelized Locality-Sensitive Hashing.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (6): 1092–1104. https://doi.org/10.1109/TPAMI.2011.219.
Lawrence, Neil, Matthias Seeger, and Ralf Herbrich. 2003. “Fast Sparse Gaussian Process Methods: The Informative Vector Machine.” In Proceedings of the 16th Annual Conference on Neural Information Processing Systems, 609–16. http://papers.nips.cc/paper/2240-fast-sparse-gaussian-process-methods-the-informative-vector-machine.
Ley, Christophe, Gesine Reinert, and Yvik Swan. 2017. “Stein’s Method for Comparison of Univariate Distributions.” Probability Surveys 14 (none): 1–52. https://doi.org/10.1214/16-PS278.
Liu, Qiang, Jason D. Lee, and Michael I. Jordan. 2016. “A Kernelized Stein Discrepancy for Goodness-of-Fit Tests and Model Evaluation.” arXiv:1602.03253 [stat], July. http://arxiv.org/abs/1602.03253.
Liutkus, Antoine, Zafar Rafii, Bryan Pardo, Derry Fitzgerald, and Laurent Daudet. 2014. “Kernel Spectrogram Models for Source Separation.” In, 6–10. IEEE. https://doi.org/10.1109/HSCMA.2014.6843240.
Ljung, L., and T. Kailath. 1976. “Backwards Markovian Models for Second-Order Stochastic Processes (Corresp.).” IEEE Transactions on Information Theory 22 (4): 488–91. https://doi.org/10.1109/TIT.1976.1055570.
Ljung, L., T. Kailath, and B. Friedlander. 1975. “Scattering Theory and Linear Least Squares Estimation: Part I: Continuous-Time Problems.” In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes, 55–56. https://doi.org/10.1109/CDC.1975.270647.
Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/1402.4304.
Lodhi, Huma, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. “Text Classification Using String Kernels.” Journal of Machine Learning Research 2 (March): 419–44. http://jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf.
Lopez-Paz, David, Robert Nishihara, Soumith Chintala, Bernhard Schölkopf, and Léon Bottou. 2016. “Discovering Causal Signals in Images.” arXiv:1605.08179 [cs, Stat], May. http://arxiv.org/abs/1605.08179.
Lu, Zhengdong, Todd K. Leen, Yonghong Huang, and Deniz Erdogmus. 2008. “A Reproducing Kernel Hilbert Space Framework for Pairwise Time Series Distances.” In Proceedings of the 25th International Conference on Machine Learning, 624–31. ICML ’08. New York, NY, USA: ACM. https://doi.org/10.1145/1390156.1390235.
Ma, Siyuan, and Mikhail Belkin. 2017. “Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning.” arXiv:1703.10622 [cs, Stat], March. http://arxiv.org/abs/1703.10622.
Manton, Jonathan H., and Pierre-Olivier Amblard. 2015. “A Primer on Reproducing Kernel Hilbert Spaces.” Foundations and Trends® in Signal Processing 8 (1–2): 1–126. https://doi.org/10.1561/2000000050.
McFee, Brian, and Daniel PW Ellis. 2011. “Analyzing Song Structure with Spectral Clustering.” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). http://www.ee.columbia.edu/~dpwe/pubs/McFeeE14-structure.pdf.
Meidan, R. 1980. “On the Connection Between Ordinary and Generalized Stochastic Processes.” Journal of Mathematical Analysis and Applications 76 (1): 124–33. https://doi.org/10.1016/0022-247X(80)90066-9.
Mercer, J. 1909. “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441-458): 415–46. https://doi.org/10.1098/rsta.1909.0016.
Micchelli, Charles A., and Massimiliano Pontil. 2005a. “Learning the Kernel Function via Regularization.” Journal of Machine Learning Research 6 (Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.
———. 2005b. “On Learning Vector-Valued Functions.” Neural Computation 17 (1): 177–204. https://doi.org/10.1162/0899766052530802.
Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Schölkopf. 2014. “Kernel Mean Shrinkage Estimators.” arXiv:1405.5505 [cs, Stat], May. http://arxiv.org/abs/1405.5505.
Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. 2017. “Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends® in Machine Learning 10 (1-2): 1–141. https://doi.org/10.1561/2200000060.
Muller, K., S. Mika, G. Ratsch, K. Tsuda, and Bernhard Scholkopf. 2001. “An Introduction to Kernel-Based Learning Algorithms.” IEEE Transactions on Neural Networks 12 (2): 181–201. https://doi.org/10.1109/72.914517.
Parzen, Emanuel. 1959. “Statistical Inference On Time Series By Hilbert Space Methods, I.” TR23. STANFORD UNIV CA APPLIED MATHEMATICS AND STATISTICS LABS. https://apps.dtic.mil/docs/citations/AD0210363.
Parzen, Emanuel. 1962. “Extraction and Detection Problems and Reproducing Kernel Hilbert Spaces.” Journal of the Society for Industrial and Applied Mathematics Series A Control 1 (1): 35–62. https://doi.org/10.1137/0301004.
Parzen, Emanuel. 1963. “Probability Density Functionals and Reproducing Kernel Hilbert Spaces.” In Proceedings of the Symposium on Time Series Analysis, 196:155–69. Wiley, New York.
Pillonetto, Gianluigi. 2016. “The Interplay Between System Identification and Machine Learning.” arXiv:1612.09158 [cs, Stat], December. http://arxiv.org/abs/1612.09158.
Poggio, T., and F. Girosi. 1990. “Networks for Approximation and Learning.” Proceedings of the IEEE 78 (9): 1481–97. https://doi.org/10.1109/5.58326.
Rahimi, Ali, and Benjamin Recht. 2007. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems, 1177–84. Curran Associates, Inc. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.
———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems, 1313–20. Curran Associates, Inc. http://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.
Ramdas, Aaditya, and Leila Wehbe. 2014. “Stein Shrinkage for Cross-Covariance Operators and Kernel Independence Testing.” arXiv:1406.1922 [stat], June. http://arxiv.org/abs/1406.1922.
Raykar, Vikas C., and Ramani Duraiswami. 2005. “The Improved Fast Gauss Transform with Applications to Machine Learning.” presented at the NIPS. http://www.umiacs.umd.edu/users/vikas/publications/IFGT_slides.pdf.
Rue, Håvard, and Leonhard Held. 2005. Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability 104. Boca Raton: Chapman & Hall/CRC.
Saha, Akash, and Palaniappan Balamurugan. 2020. “Learning with Operator-Valued Kernels in Reproducing Kernel Krein Spaces.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/9f319422ca17b1082ea49820353f14ab-Abstract.html.
Särkkä, Simo. 2011. “Linear Operators and Stochastic Partial Differential Equations in Gaussian Process Regression.” In Artificial Neural Networks and Machine Learning – ICANN 2011, edited by Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel Kaski, 6792:151–58. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-21738-8_20.
Schaback, Robert, and Holger Wendland. 2006. “Kernel Techniques: From Machine Learning to Meshless Methods.” Acta Numerica 15 (May): 543–639. https://doi.org/10.1017/S0962492906270016.
Schlegel, Kevin. 2018. “When Is There a Representer Theorem? Reflexive Banach Spaces.” arXiv:1809.10284 [cs, Math, Stat], September. http://arxiv.org/abs/1809.10284.
Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. “A Generalized Representer Theorem.” In Computational Learning Theory, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44581-1.
Schölkopf, Bernhard, Phil Knirsch, Alex Smola, and Chris Burges. 1998. “Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces.” In Mustererkennung 1998, edited by Paul Levi, Michael Schanz, Rolf-Jürgen Ahlers, and Franz May, 125–32. Informatik Aktuell. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-72282-0_12.
Schölkopf, Bernhard, Sebastian Mika, Chris J. C. Burges, Philipp Knirsch, Klaus-Robert Müller, Gunnar Rätsch, and Alexander J. Smola. 1999. “Input Space Versus Feature Space in Kernel-Based Methods.” IEEE Transactions on Neural Networks 10: 1000–1017.
Schölkopf, Bernhard, Krikamol Muandet, Kenji Fukumizu, and Jonas Peters. 2015. “Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.” arXiv:1501.06794 [cs, Stat], January. http://arxiv.org/abs/1501.06794.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
———. 2003. “A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.
Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1997. “Kernel Principal Component Analysis.” In Artificial Neural Networks — ICANN’97, edited by Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, 583–88. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0020217.
Schuster, Ingmar, Mattes Mollenhauer, Stefan Klus, and Krikamol Muandet. 2019. “Kernel Conditional Density Operators.” arXiv:1905.11255 [cs, Math, Stat], May. http://arxiv.org/abs/1905.11255.
Schuster, Ingmar, Heiko Strathmann, Brooks Paige, and Dino Sejdinovic. 2017. “Kernel Sequential Monte Carlo.” In ECML-PKDD 2017. http://arxiv.org/abs/1510.03105.
Segall, A., M. Davis, and T. Kailath. 1975. “Nonlinear Filtering with Counting Observations.” IEEE Transactions on Information Theory 21 (2): 143–49. https://doi.org/10.1109/TIT.1975.1055360.
Segall, A., and T. Kailath. 1976. “Orthogonal Functionals of Independent-Increment Processes.” IEEE Transactions on Information Theory 22 (3): 287–98. https://doi.org/10.1109/TIT.1976.1055560.
Shen, Yanning, Brian Baingana, and Georgios B. Giannakis. 2016. “Nonlinear Structural Vector Autoregressive Models for Inferring Effective Brain Network Connectivity.” arXiv:1610.06551 [stat], October. http://arxiv.org/abs/1610.06551.
Smola, A. J., and B. Schölkopf. 1998. “On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion.” Algorithmica 22 (1-2): 211–31. https://doi.org/10.1007/PL00013831.
Smola, Alex J., and Bernhard Schölkopf. 2000. “Sparse Greedy Matrix Approximation for Machine Learning.” http://www.kernel-machines.org/papers/upload_4467_kfa_long.ps.gz.
———. 2004. “A Tutorial on Support Vector Regression.” Statistics and Computing 14 (3): 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88.
Smola, Alex J., Bernhard Schölkopf, and Klaus-Robert Müller. 1998. “The Connection Between Regularization Operators and Support Vector Kernels.” Neural Networks 11 (4): 637–49. https://doi.org/10.1016/S0893-6080(98)00032-X.
Snelson, Edward, and Zoubin Ghahramani. 2005. “Sparse Gaussian Processes Using Pseudo-Inputs.” In Advances in Neural Information Processing Systems, 1257–64. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_543.pdf.
Solin, Arno, and Simo Särkkä. 2020. “Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing 30 (2): 419–46. https://doi.org/10.1007/s11222-019-09886-w.
Sriperumbudur, B. K., A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf. 2008. “Injective Hilbert Space Embeddings of Probability Measures.” In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008). http://eprints.pascal-network.org/archive/00004340/.
Steinwart, Ingo. 2020. “Reproducing Kernel Hilbert Spaces Cannot Contain All Continuous Functions on a Compact Metric Space.” arXiv:2002.03171 [cs, Math], March. http://arxiv.org/abs/2002.03171.
Székely, Gábor J., and Maria L. Rizzo. 2009. “Brownian distance covariance.” The Annals of Applied Statistics 3 (4): 1236–65. https://doi.org/10.1214/09-AOAS312.
Székely, Gábor J., Maria L. Rizzo, and Nail K. Bakirov. 2007. “Measuring and testing dependence by correlation of distances.” The Annals of Statistics 35 (6): 2769–94. https://doi.org/10.1214/009053607000000505.
Tipping, Michael E., and Cambridge Cb Nh. 2001. “Sparse Kernel Principal Component Analysis.” In Advances in Neural Information Processing Systems 13, 633–39. MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf.
Tompkins, Anthony, and Fabio Ramos. 2018. “Fourier Feature Approximations for Periodic Kernels in Time-Series Modelling.” Proceedings of the AAAI Conference on Artificial Intelligence 32 (1). https://ojs.aaai.org/index.php/AAAI/article/view/11696.
Vedaldi, A., and A. Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.
Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. “A Primer on Kernel Methods.” In Kernel Methods in Computational Biology. MIT Press. http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2549.pdf.
Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. “Graph Kernels.” Journal of Machine Learning Research 11 (August): 1201–42. http://authors.library.caltech.edu/20528/1/Vishwanathan2010p11646J_Mach_Learn_Res.pdf.
Walder, C., B. Schölkopf, and O. Chapelle. 2006. “Implicit Surface Modelling with a Globally Regularised Basis of Compact Support.” Computer Graphics Forum 25 (3): 635–44. https://doi.org/10.1111/j.1467-8659.2006.00983.x.
Walder, Christian, Kwang In Kim, and Bernhard Schölkopf. 2008. “Sparse Multiscale Gaussian Process Regression.” In Proceedings of the 25th International Conference on Machine Learning, 1112–19. ICML ’08. New York, NY, USA: ACM. https://doi.org/10.1145/1390156.1390296.
Wang, Yu-Xiang, Alex Smola, and Ryan J. Tibshirani. 2014. “The Falling Factorial Basis and Its Statistical Applications.” In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, 730–38. ICML’14. Beijing, China: JMLR.org. http://arxiv.org/abs/1405.0558.
Weinert, H., and G. Sidhu. 1978. “A Stochastic Framework for Recursive Computation of Spline Functions–Part I: Interpolating Splines.” IEEE Transactions on Information Theory 24 (1): 45–50. https://doi.org/10.1109/TIT.1978.1055825.
Weinert, Howard L. 1978. “Statistical Methods in Optimal Curve Fitting.” Communications in Statistics - Simulation and Computation 7 (4): 417–35. https://doi.org/10.1080/03610917808812088.
Weinert, Howard L., and Thomas Kailath. 1974. “Stochastic Interpretations and Recursive Algorithms for Spline Functions.” The Annals of Statistics 2 (4): 787–94. https://doi.org/10.1214/aos/1176342765.
Williams, Christopher K. I. 2001. “On a Connection Between Kernel PCA and Metric Multidimensional Scaling.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 46:675–81. MIT Press. https://doi.org/10.1023/A:1012485807823.
Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning. http://arxiv.org/abs/1302.4245.
Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. “The Human Kernel.” arXiv:1510.07389 [cs, Stat], October. http://arxiv.org/abs/1510.07389.
Wu, Qiang, and Ding-Xuan Zhou. 2008. “Learning with Sample Dependent Hypothesis Spaces.” Computers & Mathematics with Applications 56 (11): 2896–2907. https://doi.org/10.1016/j.camwa.2008.09.014.
Xu, Jian-Wu, A.R.C. Paiva, Il Park, and J.C. Principe. 2008. “A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning.” IEEE Transactions on Signal Processing 56 (12): 5891–5902. https://doi.org/10.1109/TSP.2008.2005085.
Xu, Wenkai, and Takeru Matsuda. 2020. “A Stein Goodness-of-Fit Test for Directional Distributions.” In International Conference on Artificial Intelligence and Statistics, 320–30. PMLR. http://arxiv.org/abs/2002.06843.
———. 2021. “Interpretable Stein Goodness-of-Fit Tests on Riemannian Manifolds.” arXiv:2103.00895 [stat], March. http://arxiv.org/abs/2103.00895.
Xu, Wenkai, and Gesine Reinert. 2021. “A Stein Goodness of Fit Test for Exponential Random Graph Models.” arXiv:2103.00580 [stat], February. http://arxiv.org/abs/2103.00580.
Yaglom, A. M. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics. New York, NY: Springer Science & Business Media.
Yang, Changjiang, Ramani Duraiswami, and Larry S. Davis. 2004. “Efficient Kernel Machines Using the Improved Fast Gauss Transform.” In Advances in Neural Information Processing Systems, 1561–68. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_439.pdf.
Yang, Changjiang, Ramani Duraiswami, Nail A. Gumerov, and Larry Davis. 2003. “Improved Fast Gauss Transform and Efficient Kernel Density Estimation.” In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, 464–64. ICCV ’03. Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICCV.2003.1238383.
Yang, Jiyan, Vikas Sindhwani, Haim Avron, and Michael Mahoney. 2014. “Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels.” arXiv:1412.8293 [cs, Math, Stat], December. http://arxiv.org/abs/1412.8293.
Yang, Tianbao, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. 2012. “Nyström Method Vs Random Fourier Features: A Theoretical and Empirical Comparison.” In Advances in Neural Information Processing Systems, 476–84. http://papers.nips.cc/paper/4588-nystrom-method-vs-random-fourier-features-a-theoretical-and-empirical-comparison.
Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. “Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.
Zhang, Kun, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2012. “Kernel-Based Conditional Independence Test and Application in Causal Discovery.” arXiv:1202.3775 [cs, Stat], February. http://arxiv.org/abs/1202.3775.
Zhang, Qinyi, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. 2016. “Large-Scale Kernel Methods for Independence Testing.” arXiv:1606.07892 [stat], June. http://arxiv.org/abs/1606.07892.
Zhou, Ke, Hongyuan Zha, and Le Song. 2013. “Learning Triggering Kernels for Multi-Dimensional Hawkes Processes.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1301–9. http://proceedings.mlr.press/v28/zhou13.pdf.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.