# (Reproducing) kernel tricks

August 18, 2014 — July 21, 2023

algebra
functional analysis
Hilbert space
kernel tricks
metrics
nonparametric

WARNING: This is very old. If I were to write it now, I would write it differently, and specifically more pedagogically.

Kernel in the sense of the “kernel trick”. Not to be confused with smoothing-type convolution kernels, nor the dozens of related-but-slightly-different clashing definitions of kernel; those can have their own respective pages. Corollary: If you do not know what to name something, call it a kernel.

We are concerned with a particular flavour of kernel in Hilbert spaces, specifically reproducing or Mercer kernels . The associated function space is a reproducing Kernel Hilbert Space, which is hereafter an RKHS.

Kernel tricks comprise the application of Mercer kernels in Machine Learning. The “trick” part is that many machine learning algorithms operate on inner products. Or can be rewritten to work that way. Such algorithms permit one to swap out a boring classic Euclidean definition of that inner product in favour of a fancy RKHS one. The classic machine learning pitch for trying such a stunt is something like “upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy algebra on potentially-infinite-dimensional feature spaces, which still has a low-dimensional representation.” Or, if you’d like, “apply certain statistical learning methods based on things with an obvious finite vector space representation ($$\mathbb{R}^n$$) to things without one (Sentences, piano-rolls, $$\mathcal{C}^d_\ell$$).”

Mini history: The oft-cited origins of all the reproducing kernel stuff are . It took a while to percolate into random function theory as covariance functions. Thence the idea arrived in statistical inference and signal processing , and now it is ubiquitous.

Practically, kernel methods have problems with scalability to large data sets. To apply any such method you need to keep a full Gram matrix of inner products between every data point, which needs you to know, for $$N$$ data points, $$N(N-1)/2$$ entries of a symmetric matrix. If you need to invert that matrix the cost is $$\mathcal{O}(N^3)$$, which means you need fancy tricks to handle large $$N$$. Fancy tricks depend on what the actual model is, but include Sparse GPs, random-projection inversions, Markov approximations and presumably many more

I’m especially interested in the application of such tricks in

1. kernel regression
2. wide random NNs
3. Nonparametric kernel independence tests
4. Efficient kernel pre-image approximation
5. Connection between kernel PCA and clustering Turns out not all those applications are interesting to me.

## 1 Introductions

There are many primers on Mercer kernels and their connection to ML. Kenneth Tay’s intro is punchy. Il Shan Ng, Reproducing Kernel Hilbert Spaces & Machine Learning is good. See , which grinds out many connections with learning theory, or , which is more narrowly focussed on just the Mercer-kernel part, and the topological and geometric properties of the spaces. . Cheney and Light (2009) is an approximation-theory perspective which does not especially concern itself with stochastic processes. I also seem to have bookmarked the following introductions .

Alex Smola (who with, Bernhard Schölkopf) has his name on an intimidating proportion of publications in this area, also has all his publications online.

## 5 Non-scalar-valued “kernels”

Extending the usual inner-product framing, Operator-valued kernels, , generalise to $$k:\mathcal{X}\times \mathcal{X}\mapsto \mathcal{L}(H_Y)$$, as seen in multi-task learning.

## 6 Tools

### 6.1 KeOps

File under least squares, autodiff, gps, pytorch.

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

### 6.2 Falkon

A Python library for large-scale kernel methods, with optional (multi-)GPU acceleration.

The library currently includes two solvers: one for approximate kernel ridge regression Rudi, Carratino, and Rosasco (2017) which is extremely fast, and one for kernel logistic regression Marteau-Ferey, Bach, and Rudi (2019) which trades off lower speed for better accuracy on binary classification problems.

The main features of Falkon are:

• Full multi-GPU support - All compute-intensive parts of the algorithms are multi-GPU capable.
• Extreme scalability - Unlike other kernel solvers, we keep memory usage in check. We have tested the library with datasets of billions of points.
• Sparse data support
• Scikit-learn integration - Our estimators follow the scikit-learn API

## 7 References

Aasnaes, and Kailath. 1973. IEEE Transactions on Automatic Control.
Agarwal, and Iii. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Agrawal, and Broderick. 2021. arXiv:2106.12408 [Stat].
Agrawal, Trippe, Huggins, et al. 2019. In Proceedings of the 36th International Conference on Machine Learning.
Alaoui, and Mahoney. 2014. arXiv:1411.0306 [Cs, Stat].
Altun, Smola, and Hofmann. 2004. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. UAI ’04.
Álvarez, Rosasco, and Lawrence. 2012. Foundations and Trends® in Machine Learning.
Arbel, Korba, Salim, et al. 2019. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
Aronszajn. 1950. Transactions of the American Mathematical Society.
Azangulov, Smolensky, Terenin, et al. 2022.
Bach, Francis. 2008. In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08.
Bach, Francis R. 2013. In COLT.
Bach, Francis. 2015.
Backurs, Indyk, and Schmidt. 2017. arXiv:1704.02958 [Cs, Stat].
Bakır, Zien, and Tsuda. 2004. In Pattern Recognition. Lecture Notes in Computer Science 3175.
Balog, Lakshminarayanan, Ghahramani, et al. 2016. arXiv:1606.05241 [Stat].
Ben-Hur, Ong, Sonnenburg, et al. 2008. PLoS Comput Biol.
Bosq, and Blanke. 2007. Inference and prediction in large dimensions. Wiley series in probability and statistics.
Boyer, Chambolle, De Castro, et al. 2018. arXiv:1806.09810 [Cs, Math].
Brown, and Lin. 2004. The Annals of Statistics.
Burges. 1998. In Advances in Kernel Methods - Support Vector Learning.
Canu, and Smola. 2006. Neurocomputing.
Carrasco, Oncina, and Calera-Rubio. 2001. Machine Learning.
Cawley, and Talbot. 2005. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.
Chatfield, Lempitsky, Vedaldi, et al. 2011.
Cheney, and Light. 2009. A Course in Approximation Theory.
Choromanski, and Sindhwani. 2016. arXiv:1605.09049 [Cs, Stat].
Chwialkowski, Strathmann, and Gretton. 2016. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Clark, Florêncio, and Watkins. 2006. In Machine Learning: ECML 2006. Lecture Notes in Computer Science 4212.
Clark, Florêncio, Watkins, et al. 2006. In Grammatical Inference: Algorithms and Applications. Lecture Notes in Computer Science 4201.
Clark, and Watkins. 2008. Fundamenta Informaticae.
Collins, and Duffy. 2002. In Advances in Neural Information Processing Systems 14.
Cortes, Haffner, and Mohri. 2004. Journal of Machine Learning Research.
Cucker, and Smale. 2002. Bulletin of the American Mathematical Society.
Cunningham, Shenoy, and Sahani. 2008. In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.
Curtain. 1975. SIAM Journal on Control.
Danafar, Fukumizu, and Gomez. 2014. arXiv:1408.5810 [Stat].
Devroye, Györfi, and Lugosi. 1996. A Probabilistic Theory of Pattern Recognition.
Domingos. 2020. arXiv:2012.00152 [Cs, Stat].
Drineas, and Mahoney. 2005. Journal of Machine Learning Research.
Duttweiler, and Kailath. 1973a. IEEE Transactions on Information Theory.
———. 1973b. IEEE Transactions on Information Theory.
Duvenaud, Lloyd, Grosse, et al. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Evgeniou, Micchelli, and Pontil. 2005. Journal of Machine Learning Research.
Feragen, and Hauberg. 2016. In Conference on Learning Theory.
FitzGerald, Liukus, Rafii, et al. 2013. In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET.
Flaxman, Teh, and Sejdinovic. 2016. arXiv:1610.08623 [Stat].
Friedlander, Kailath, and Ljung. 1975. In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes.
Genton. 2001. Journal of Machine Learning Research.
Gevers, and Kailath. 1973. IEEE Transactions on Automatic Control.
Ghojogh, Ghodsi, Karray, et al. 2021.
Globerson, and Livni. 2016. arXiv:1606.05316 [Cs].
Gorham, Raj, and Mackey. 2020. arXiv:2007.02857 [Cs, Math, Stat].
Gori, and Martínez-Herrero. 2021. JOSA A.
Gottwald, and Reich. 2020. arXiv:2007.07383 [Physics, Stat].
Grauman, and Darrell. 2005. In Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005.
Greengard, and Strain. 1991. SIAM Journal on Scientific and Statistical Computing.
Gretton. 2019.
Gretton, Borgwardt, Rasch, et al. 2012. The Journal of Machine Learning Research.
Gretton, Fukumizu, Teo, et al. 2008. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference.
Grosse, Salakhutdinov, Freeman, et al. 2012. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Györfi, ed. 2002. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics.
Haussler. 1999.
Heinonen, and d’Alché-Buc. 2014. arXiv:1411.5172 [Cs, Stat].
Hofmann, Schölkopf, and Smola. 2008. The Annals of Statistics.
Ishikawa, Fujii, Ikeda, et al. 2018. arXiv:1805.12324 [Cs, Math, Stat].
Jain. 2009. “Structure Spaces.” Journal of Machine Learning Research.
Jung. 2013. In Advances in Neural Information Processing Systems 29.
Kailath, Thomas. 1971. “The Structure of Radon-Nikodym Derivatives with Respect to Wiener and Related Measures.” The Annals of Mathematical Statistics.
Kailath, T. 1971a. IEEE Transactions on Information Theory.
———. 1971b. In 1971 IEEE Conference on Decision and Control.
———. 1974. IEEE Transactions on Information Theory.
Kailath, T., and Duttweiler. 1972. IEEE Transactions on Information Theory.
Kailath, T., and Geesey. 1971. IEEE Transactions on Automatic Control.
———. 1973. IEEE Transactions on Automatic Control.
Kailath, T., Geesey, and Weinert. 1972. IEEE Transactions on Information Theory.
Kailath, T., and Weinert. 1975. IEEE Transactions on Information Theory.
Kanagawa, and Fukumizu. 2014. In Journal of Machine Learning Research.
Kanagawa, Hennig, Sejdinovic, et al. 2018. arXiv:1807.02582 [Cs, Stat].
Katharopoulos, Vyas, Pappas, et al. 2020. arXiv:2006.16236 [Cs, Stat].
Kemerait, and Childers. 1972. IEEE Transactions on Information Theory.
Keriven, Bourrier, Gribonval, et al. 2016. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Khintchine. 1934. Mathematische Annalen.
Kimeldorf, and Wahba. 1970. The Annals of Mathematical Statistics.
Kiraly, and Oberhauser. 2019. Journal of Machine Learning Research.
Kloft, Rückert, and Bartlett. 2010. In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Klus, Bittracher, Schuster, et al. 2018. The Journal of Chemical Physics.
Kontorovich, Leonid, Cortes, and Mohri. 2006. In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.
Kontorovich, Leonid (Aryeh), Cortes, and Mohri. 2008. Theoretical Computer Science, Algorithmic Learning Theory,.
Koppel, Warnell, Stump, et al. 2016. arXiv:1612.04111 [Cs, Stat].
Krauth, Bonilla, Cutajar, et al. 2016. In UAI17.
Kulis, and Grauman. 2012. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Lawrence, Seeger, and Herbrich. 2003. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems.
Ley, Reinert, and Swan. 2017. Probability Surveys.
Liu, Qiang, Lee, and Jordan. 2016. In Proceedings of The 33rd International Conference on Machine Learning.
Liutkus, Rafii, Pardo, et al. 2014. In.
Liu, Xi, Zhan, and Niu. 2021. Mathematical Problems in Engineering.
Ljung, and Kailath. 1976. IEEE Transactions on Information Theory.
Ljung, Kailath, and Friedlander. 1975. In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes.
Lloyd, Duvenaud, Grosse, et al. 2014. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Lodhi, Saunders, Shawe-Taylor, et al. 2002. Journal of Machine Learning Research.
Lopez-Paz, Nishihara, Chintala, et al. 2016. arXiv:1605.08179 [Cs, Stat].
Lu, Leen, Huang, et al. 2008. In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.
Ma, Siyuan, and Belkin. 2017. arXiv:1703.10622 [Cs, Stat].
Ma, Wan-Duo Kurt, Lewis, and Kleijn. 2020. Proceedings of the AAAI Conference on Artificial Intelligence.
Manton, and Amblard. 2015. Foundations and Trends® in Signal Processing.
Marteau-Ferey, Bach, and Rudi. 2019. In Advances in Neural Information Processing Systems.
———. 2020. In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
McFee, and Ellis. 2011. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Meanti, Carratino, Rosasco, et al. 2020. In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Meidan. 1980. Journal of Mathematical Analysis and Applications.
Mercer. 1909. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character.
Micchelli, and Pontil. 2005a. Journal of Machine Learning Research.
———. 2005b. Neural Computation.
Minh. 2022. SIAM/ASA Journal on Uncertainty Quantification.
Muandet, Fukumizu, Sriperumbudur, et al. 2014. arXiv:1405.5505 [Cs, Stat].
Muandet, Fukumizu, Sriperumbudur, et al. 2017. Foundations and Trends® in Machine Learning.
Muller, Mika, Ratsch, et al. 2001. IEEE Transactions on Neural Networks.
Nishiyama, and Fukumizu. 2016. The Journal of Machine Learning Research.
Noack, Luo, and Risser. 2023.
Parzen, Emanuel. 1959. TR23.
Parzen, Emanuel. 1962. Journal of the Society for Industrial and Applied Mathematics Series A Control.
Parzen, Emanuel. 1963. “Probability Density Functionals and Reproducing Kernel Hilbert Spaces.” In Proceedings of the Symposium on Time Series Analysis.
Pillonetto. 2016. arXiv:1612.09158 [Cs, Stat].
Poggio, and Girosi. 1990. Proceedings of the IEEE.
Rahimi, and Recht. 2007. In Advances in Neural Information Processing Systems.
———. 2009. In Advances in Neural Information Processing Systems.
Ramdas, and Wehbe. 2014. arXiv:1406.1922 [Stat].
Raykar, and Duraiswami. 2005.
Rudi, Carratino, and Rosasco. 2017. In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
Rue, and Held. 2005. Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability 104.
Sachdeva, Dhaliwal, Wu, et al. 2022.
Saha, and Balamurugan. 2020. In Advances in Neural Information Processing Systems.
Salvi, Cass, Foster, et al. 2021. SIAM Journal on Mathematics of Data Science.
Salvi, Lemercier, Liu, et al. 2024. In Advances in Neural Information Processing Systems. NIPS ’21.
Särkkä. 2011. In Artificial Neural Networks and Machine Learning – ICANN 2011. Lecture Notes in Computer Science.
Schaback, and Wendland. 2006. Acta Numerica.
Schlegel. 2018. arXiv:1809.10284 [Cs, Math, Stat].
Schölkopf, Herbrich, and Smola. 2001. In Computational Learning Theory. Lecture Notes in Computer Science.
Schölkopf, Knirsch, Smola, et al. 1998. In Mustererkennung 1998. Informatik Aktuell.
Schölkopf, Mika, Burges, et al. 1999. “Input Space Versus Feature Space in Kernel-Based Methods.” IEEE Transactions on Neural Networks.
Schölkopf, Muandet, Fukumizu, et al. 2015. arXiv:1501.06794 [Cs, Stat].
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
———. 2003. In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.
Schölkopf, Smola, and Müller. 1997. In Artificial Neural Networks — ICANN’97. Lecture Notes in Computer Science.
Schuster, Mollenhauer, Klus, et al. 2019. arXiv:1905.11255 [Cs, Math, Stat].
Schuster, Strathmann, Paige, et al. 2017. In ECML-PKDD 2017.
Segall, Davis, and Kailath. 1975. IEEE Transactions on Information Theory.
Segall, and Kailath. 1976. IEEE Transactions on Information Theory.
Shen, Baingana, and Giannakis. 2016. arXiv:1610.06551 [Stat].
Smola, A. J., and Schölkopf. 1998. Algorithmica.
Smola, Alex J., and Schölkopf. 2000.
———. 2004. Statistics and Computing.
Smola, Alex J., Schölkopf, and Müller. 1998. Neural Networks.
Snelson, and Ghahramani. 2005. In Advances in Neural Information Processing Systems.
Solin, and Särkkä. 2020. Statistics and Computing.
Song, Fukumizu, and Gretton. 2013. IEEE Signal Processing Magazine.
Song, Gretton, Bickson, et al. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Sriperumbudur, Gretton, Fukumizu, et al. 2008. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
Steinwart. 2020. arXiv:2002.03171 [Cs, Math].
Székely, and Rizzo. 2009. The Annals of Applied Statistics.
Székely, Rizzo, and Bakirov. 2007. The Annals of Statistics.
Tipping, and Nh. 2001. In Advances in Neural Information Processing Systems 13.
Tompkins, and Ramos. 2018. Proceedings of the AAAI Conference on Artificial Intelligence.
Tsuchida, Ong, and Sejdinovic. 2023.
Vedaldi, and Zisserman. 2012. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vert, Tsuda, and Schölkopf. 2004. In Kernel Methods in Computational Biology.
Vishwanathan, Schraudolph, Kondor, et al. 2010. Journal of Machine Learning Research.
Walder, Christian, Kim, and Schölkopf. 2008. In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.
Walder, C., Schölkopf, and Chapelle. 2006. Computer Graphics Forum.
Wang, Smola, and Tibshirani. 2014. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14.
Weinert, Howard L. 1978. Communications in Statistics - Simulation and Computation.
Weinert, Howard L., and Kailath. 1974. The Annals of Statistics.
Weinert, H., and Sidhu. 1978. IEEE Transactions on Information Theory.
Williams. 2001. In Advances in Neural Information Processing Systems 13.
Wilson, and Adams. 2013. In International Conference on Machine Learning.
Wilson, Dann, Lucas, et al. 2015. arXiv:1510.07389 [Cs, Stat].
Wu, and Zhou. 2008. Computers & Mathematics with Applications.
Xu, Wenkai, and Matsuda. 2020. In International Conference on Artificial Intelligence and Statistics.
———. 2021. arXiv:2103.00895 [Stat].
Xu, Jian-Wu, Paiva, Park, et al. 2008. IEEE Transactions on Signal Processing.
Xu, Wenkai, and Reinert. 2021. arXiv:2103.00580 [Stat].
Yaglom. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics.
Yang, Changjiang, Duraiswami, and Davis. 2004. In Advances in Neural Information Processing Systems.
Yang, Changjiang, Duraiswami, Gumerov, et al. 2003. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. ICCV ’03.
Yang, Tianbao, Li, Mahdavi, et al. 2012. In Advances in Neural Information Processing Systems.
Yang, Jiyan, Sindhwani, Avron, et al. 2014. arXiv:1412.8293 [Cs, Math, Stat].
Yu, Cheng, Schuurmans, et al. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Zhang, Qinyi, Filippi, Gretton, et al. 2016. arXiv:1606.07892 [Stat].
Zhang, Kun, Peters, Janzing, et al. 2012. arXiv:1202.3775 [Cs, Stat].
Zhou, Zha, and Song. 2013. In Proceedings of the 30th International Conference on Machine Learning (ICML-13).