**WARNING**: This is very old. If I were to write it now, I would write it differently, and specifically more pedagogically.

Kernel in the sense of the “kernel trick”.
Not to be confused with smoothing-type
convolution kernels,
nor the dozens of related-but-slightly-different
clashing definitions of *kernel*;
those can have their own respective pages.
Corollary: If you do not know what to name something, call it a kernel.

We are concerned with a particular flavour of kernel in Hilbert spaces, specifically *reproducing* or *Mercer* kernels (Mercer 1909).
The associated function space is a *reproducing Kernel Hilbert Space*, which is hereafter an *RKHS*.

Kernel *tricks* comprise the application of Mercer kernels in Machine Learning.
The “trick” part is that many machine learning algorithms operate on inner products.
Or can be rewritten to work that way.
Such algorithms permit one to swap out a boring classic Euclidean definition of that inner product in favour of a fancy RKHS one.
The classic machine learning pitch for trying such a stunt is something like
“upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy algebra on
potentially-infinite-dimensional feature spaces, which still has a low-dimensional representation.”
Or, if you’d like, “apply certain statistical learning methods based on things
with an obvious finite vector space representation (\(\mathbb{R}^n\))
to things without one (Sentences, piano-rolls, \(\mathcal{C}^d_\ell\)).”

Mini history: The oft-cited origins of all the reproducing kernel stuff are (Aronszajn 1950; Mercer 1909). It took a while to percolate into random function theory (Khintchine 1934; Yaglom 1987) as covariance functions. Thence the idea arrived in statistical inference (Emanuel. Parzen 1962; E. Parzen 1963, 1959) and signal processing (Aasnaes and Kailath 1973; Duttweiler and Kailath 1973a, 1973b; Gevers and Kailath 1973; T. Kailath and Geesey 1971, 1973; T. Kailath 1971b, 1971a, 1974; T. Kailath, Geesey, and Weinert 1972; T. Kailath and Duttweiler 1972; T. Kailath and Weinert 1975), and now it is ubiquitous.

Practically, kernel methods have problems with scalability to large data sets. To apply any such method you need to keep a full Gram matrix of inner products between every data point, which needs you to know, for \(N\) data points, \(N(N-1)/2\) entries of a symmetric matrix. If you need to invert that matrix the cost is \(\mathcal{O}(N^3)\), which means you need fancy tricks to handle large \(N\). Fancy tricks depend on what the actual model is, but include Sparse GPs, random-projection inversions, Markov approximations and presumably many more

I’m especially interested in the application of such tricks in

- kernel regression
- wide random NNs
- Nonparametric kernel independence tests
~~Efficient kernel pre-image approximation~~~~Connection between kernel PCA and clustering (Schölkopf et al. 1998; Williams 2001)~~*Turns out not all those applications are interesting to me.*

## Introductions

There are many primers on Mercer kernels and their connection to ML. Kenneth Tay’s intro is punchy. See (Schölkopf and Smola 2002), which grinds out many connections with learning theory, or (Manton and Amblard 2015), which is more narrowly focussed on just the Mercer-kernel part which emphasises topological and geometric properties of the spaces, or (Cheney and Light 2009) for an approximation-theory perspective which does not especially concern itself with stochastic processes. I also seem to have bookmarked the following introductions (Vert, Tsuda, and Schölkopf 2004; Schölkopf et al. 1999; Schölkopf, Herbrich, and Smola 2001; Muller et al. 2001; Schölkopf and Smola 2003).

Alex Smola (who with, Bernhard Schölkopf) has his name on an intimidating proportion of publications in this area, also has all his publications online.

## Kernel approximation

See kernel approximation.

## RKHS distribution embedding

## Specific kernels

See covariance functions.

## Non-scalar-valued “kernels”

Extending the usual inner-product framing,
*Operator-valued kernels*,
(Micchelli and Pontil 2005a; Evgeniou, Micchelli, and Pontil 2005; Álvarez, Rosasco, and Lawrence 2012), generalise to
\(k:\mathcal{X}\times \mathcal{X}\mapsto \mathcal{L}(H_Y)\),
as seen in multi-task learning.

## Tools

### KeOps

File under least squares, autodiff, gps, pytorch.

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

### Falkon

A Python library for large-scale kernel methods, with optional (multi-)GPU acceleration.

The library currently includes two solvers: one for approximate kernel ridge regression Rudi, Carratino, and Rosasco (2017) which is extremely fast, and one for kernel logistic regression Marteau-Ferey, Bach, and Rudi (2019) which trades off lower speed for better accuracy on binary classification problems.

The main features of Falkon are:

Full multi-GPU support- All compute-intensive parts of the algorithms are multi-GPU capable.Extreme scalability- Unlike other kernel solvers, we keep memory usage in check. We have tested the library with datasets of billions of points.Sparse data supportScikit-learn integration- Our estimators follow the scikit-learn API

## References

*IEEE Transactions on Automatic Control*18 (6): 601–7.

*Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics*, 85–92.

*arXiv:2106.12408 [Stat]*, October.

*Proceedings of the 36th International Conference on Machine Learning*, 141–50. PMLR.

*arXiv:1411.0306 [Cs, Stat]*, November.

*Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence*, 2–9. UAI ’04. Arlington, Virginia, United States: AUAI Press.

*Foundations and Trends® in Machine Learning*4 (3): 195–266.

*Proceedings of the 33rd International Conference on Neural Information Processing Systems*, 32:6484–94. Red Hook, NY, USA: Curran Associates Inc.

*Transactions of the American Mathematical Society*68 (3): 337–404.

*Proceedings of the 21st International Conference on Neural Information Processing Systems*, 105–12. NIPS’08. USA: Curran Associates Inc.

*arXiv Preprint arXiv:1502.06800*.

*COLT*, 30:185–209.

*arXiv:1704.02958 [Cs, Stat]*, April.

*Pattern Recognition*, edited by Carl Edward Rasmussen, Heinrich H. Bülthoff, Bernhard Schölkopf, and Martin A. Giese, 253–61. Lecture Notes in Computer Science 3175. Springer Berlin Heidelberg.

*arXiv:1606.05241 [Stat]*, June.

*PLoS Comput Biol*4 (10): e1000173.

*Inference and prediction in large dimensions*. Wiley series in probability and statistics. Chichester, England ; Hoboken, NJ: John Wiley/Dunod.

*arXiv:1806.09810 [Cs, Math]*, June.

*The Annals of Statistics*32 (4): 1723–43.

*Advances in Kernel Methods - Support Vector Learning*, edited by Bernhard Schölkopf, Christopher JC Burges, and Alexander J Smola. Cambridge, MA: MIT Press.

*Neurocomputing*69 (7-9): 714–20.

*Machine Learning*44 (1-2): 185–97.

*Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.*, 3:1425–30. Montreal, QC, Canada: IEEE.

*A Course in Approximation Theory*. American Mathematical Soc.

*arXiv:1605.09049 [Cs, Stat]*, May.

*Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48*, 2606–15. ICML’16. New York, NY, USA: JMLR.org.

*Machine Learning: ECML 2006*, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 90–101. Lecture Notes in Computer Science 4212. Springer Berlin Heidelberg.

*Grammatical Inference: Algorithms and Applications*, edited by Yasubumi Sakakibara, Satoshi Kobayashi, Kengo Sato, Tetsuro Nishino, and Etsuji Tomita, 148–60. Lecture Notes in Computer Science 4201. Springer Berlin Heidelberg.

*Fundamenta Informaticae*84 (3): 291–303.

*Advances in Neural Information Processing Systems 14*, edited by T. G. Dietterich, S. Becker, and Z. Ghahramani, 625–32. MIT Press.

*Journal of Machine Learning Research*5 (December): 1035–62.

*Bulletin of the American Mathematical Society*39 (1): 1–49.

*Proceedings of the 25th International Conference on Machine Learning*, 192–99. ICML ’08. New York, NY, USA: ACM Press.

*SIAM Journal on Control*13 (1): 89–104.

*arXiv:1408.5810 [Stat]*, August.

*A Probabilistic Theory of Pattern Recognition*. New York: Springer.

*arXiv:2012.00152 [Cs, Stat]*, November.

*Journal of Machine Learning Research*6 (December): 2153–75.

*IEEE Transactions on Information Theory*19 (1): 19–28.

*IEEE Transactions on Information Theory*19 (1): 29–37.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 1166–74.

*Journal of Machine Learning Research*6 (Apr): 615–37.

*Conference on Learning Theory*, 1647–50. PMLR.

*Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET*, 35–40. IET.

*arXiv:1610.08623 [Stat]*, October.

*1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes*, 57–58.

*Journal of Machine Learning Research*2 (December): 299–312.

*IEEE Transactions on Automatic Control*18 (6): 588–600.

*arXiv:1606.05316 [Cs]*, June.

*arXiv:2007.02857 [Cs, Math, Stat]*, October.

*arXiv:2007.07383 [Physics, Stat]*, July.

*Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005*, 2:1458–1465 Vol. 2.

*SIAM Journal on Scientific and Statistical Computing*12 (1): 79–94.

*The Journal of Machine Learning Research*13 (1): 723–73.

*Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference*. Cambridge, MA: MIT Press.

*Proceedings of the Conference on Uncertainty in Artificial Intelligence*.

*A Distribution-Free Theory of Nonparametric Regression*. Springer Series in Statistics. New York: Springer.

*arXiv:1411.5172 [Cs, Stat]*, November.

*The Annals of Statistics*36 (3): 1171–1220.

*arXiv:1805.12324 [Cs, Math, Stat]*, October.

*Journal of Machine Learning Research*10.

*Advances in Neural Information Processing Systems 29*.

*IEEE Transactions on Information Theory*17 (5): 530–49.

*1971 IEEE Conference on Decision and Control*, 407–11.

*IEEE Transactions on Information Theory*20 (2): 146–81.

*IEEE Transactions on Information Theory*18 (6): 730–45.

*IEEE Transactions on Automatic Control*16 (6): 720–27.

*IEEE Transactions on Automatic Control*18 (5): 435–53.

*IEEE Transactions on Information Theory*18 (3): 341–48.

*The Annals of Mathematical Statistics*42 (3): 1054–67.

*IEEE Transactions on Information Theory*21 (1): 15–23.

*Journal of Machine Learning Research*.

*arXiv:1807.02582 [Cs, Stat]*, July.

*arXiv:2006.16236 [Cs, Stat]*, August.

*IEEE Transactions on Information Theory*18 (6): 745–59.

*2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 6190–94.

*Mathematische Annalen*109 (1): 604–15.

*The Annals of Mathematical Statistics*41 (2): 495–502.

*Machine Learning and Knowledge Discovery in Databases*, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*The Journal of Chemical Physics*149 (24): 244109.

*Theoretical Computer Science*, Algorithmic Learning Theory, 405 (3): 223–36.

*Algorithmic Learning Theory*, edited by José L. Balcázar, Philip M. Long, and Frank Stephan, 288–303. Lecture Notes in Computer Science 4264. Springer Berlin Heidelberg.

*arXiv:1612.04111 [Cs, Stat]*, December.

*UAI17*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*34 (6): 1092–1104.

*Proceedings of the 16th Annual Conference on Neural Information Processing Systems*, 609–16.

*Probability Surveys*14 (none): 1–52.

*Proceedings of The 33rd International Conference on Machine Learning*, 9.

*IEEE Transactions on Information Theory*22 (4): 488–91.

*1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes*, 55–56.

*Twenty-Eighth AAAI Conference on Artificial Intelligence*.

*Journal of Machine Learning Research*2 (March): 419–44.

*arXiv:1605.08179 [Cs, Stat]*, May.

*Proceedings of the 25th International Conference on Machine Learning*, 624–31. ICML ’08. New York, NY, USA: ACM.

*arXiv:1703.10622 [Cs, Stat]*, March.

*Foundations and Trends® in Signal Processing*8 (1–2): 1–126.

*Advances in Neural Information Processing Systems*. Vol. 32. Red Hook, NY, USA: Curran Associates, Inc.

*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

*Proceedings of the 34th International Conference on Neural Information Processing Systems*, 14410–22. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.

*Journal of Mathematical Analysis and Applications*76 (1): 124–33.

*Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character*209 (441-458): 415–46.

*Journal of Machine Learning Research*6 (Jul): 1099–1125.

*Neural Computation*17 (1): 177–204.

*SIAM/ASA Journal on Uncertainty Quantification*, February, 96–124.

*arXiv:1405.5505 [Cs, Stat]*, May.

*Foundations and Trends® in Machine Learning*10 (1-2): 1–141.

*IEEE Transactions on Neural Networks*12 (2): 181–201.

*The Journal of Machine Learning Research*17 (1): 6240–67.

*Proceedings of the Symposium on Time Series Analysis*, 196:155–69. Wiley, New York.

*Journal of the Society for Industrial and Applied Mathematics Series A Control*1 (1): 35–62.

*arXiv:1612.09158 [Cs, Stat]*, December.

*Proceedings of the IEEE*78 (9): 1481–97.

*Advances in Neural Information Processing Systems*, 1177–84. Curran Associates, Inc.

*Advances in Neural Information Processing Systems*, 1313–20. Curran Associates, Inc.

*arXiv:1406.1922 [Stat]*, June.

*Proceedings of the 31st International Conference on Neural Information Processing Systems*, 3891–901. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.

*Gaussian Markov Random Fields: Theory and Applications*. Monographs on Statistics and Applied Probability 104. Boca Raton: Chapman & Hall/CRC.

*Stat*10 (1): e329.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Artificial Neural Networks and Machine Learning – ICANN 2011*, edited by Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel Kaski, 6792:151–58. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

*Acta Numerica*15 (May): 543–639.

*arXiv:1809.10284 [Cs, Math, Stat]*, September.

*Computational Learning Theory*, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*Mustererkennung 1998*, edited by Paul Levi, Michael Schanz, Rolf-Jürgen Ahlers, and Franz May, 125–32. Informatik Aktuell. Springer Berlin Heidelberg.

*IEEE Transactions on Neural Networks*10: 1000–1017.

*arXiv:1501.06794 [Cs, Stat]*, January.

*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.

*Advanced Lectures on Machine Learning*, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg.

*Artificial Neural Networks — ICANN’97*, edited by Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, 583–88. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*arXiv:1905.11255 [Cs, Math, Stat]*, May.

*ECML-PKDD 2017*.

*IEEE Transactions on Information Theory*21 (2): 143–49.

*IEEE Transactions on Information Theory*22 (3): 287–98.

*arXiv:1610.06551 [Stat]*, October.

*Algorithmica*22 (1-2): 211–31.

*Statistics and Computing*14 (3): 199–222.

*Neural Networks*11 (4): 637–49.

*Advances in Neural Information Processing Systems*, 1257–64.

*Statistics and Computing*30 (2): 419–46.

*Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008)*.

*arXiv:2002.03171 [Cs, Math]*, March.

*The Annals of Applied Statistics*3 (4): 1236–65.

*The Annals of Statistics*35 (6): 2769–94.

*Advances in Neural Information Processing Systems 13*, 633–39. MIT Press.

*Proceedings of the AAAI Conference on Artificial Intelligence*32 (1).

*IEEE Transactions on Pattern Analysis and Machine Intelligence*34 (3): 480–92.

*Kernel Methods in Computational Biology*. MIT Press.

*Journal of Machine Learning Research*11 (August): 1201–42.

*Proceedings of the 25th International Conference on Machine Learning*, 1112–19. ICML ’08. New York, NY, USA: ACM.

*Computer Graphics Forum*25 (3): 635–44.

*Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32*, 730–38. ICML’14. Beijing, China: JMLR.org.

*Communications in Statistics - Simulation and Computation*7 (4): 417–35.

*The Annals of Statistics*2 (4): 787–94.

*IEEE Transactions on Information Theory*24 (1): 45–50.

*Advances in Neural Information Processing Systems 13*, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 46:675–81. MIT Press.

*International Conference on Machine Learning*.

*arXiv:1510.07389 [Cs, Stat]*, October.

*Computers & Mathematics with Applications*56 (11): 2896–2907.

*IEEE Transactions on Signal Processing*56 (12): 5891–5902.

*International Conference on Artificial Intelligence and Statistics*, 320–30. PMLR.

*arXiv:2103.00895 [Stat]*, March.

*arXiv:2103.00580 [Stat]*, February.

*Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References*. Springer Series in Statistics. New York, NY: Springer Science & Business Media.

*Advances in Neural Information Processing Systems*, 1561–68.

*Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2*, 464–64. ICCV ’03. Washington, DC, USA: IEEE Computer Society.

*arXiv:1412.8293 [Cs, Math, Stat]*, December.

*Advances in Neural Information Processing Systems*, 476–84.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 570–78.

*arXiv:1202.3775 [Cs, Stat]*, February.

*arXiv:1606.07892 [Stat]*, June.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 1301–9.

## No comments yet. Why not leave one?