(Reproducing) kernel tricks

2014-08-18 — 2023-07-20

algebra

functional analysis

Hilbert space

kernel tricks

metrics

nonparametric

Suspiciously similar content

WARNING: This is very old. If I were to write it now, I would write it differently, and specifically more pedagogically.

Kernel in the sense of the “kernel trick”. Not to be confused with smoothing-type convolution kernels, nor the dozens of related-but-slightly-different clashing definitions of kernel; those can have their own respective pages. Corollary: If you do not know what to name something, call it a kernel.

We are concerned with a particular flavour of kernel in Hilbert spaces, specifically reproducing or Mercer kernels (Mercer 1909). The associated function space is a reproducing Kernel Hilbert Space, which is hereafter an RKHS.

Kernel tricks comprise the application of Mercer kernels in Machine Learning. The “trick” part is that many machine learning algorithms operate on inner products. Or can be rewritten to work that way. Such algorithms permit one to swap out a boring classic Euclidean definition of that inner product in favour of a fancy RKHS one. The classic machine learning pitch for trying such a stunt is something like “upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy algebra on potentially-infinite-dimensional feature spaces, which still has a low-dimensional representation.” Or, if you’d like, “apply certain statistical learning methods based on things with an obvious finite vector space representation ( $R^{n}$ ) to things without one (Sentences, piano-rolls, $C_{ℓ}^{d}$ ).”

Mini history: The oft-cited origins of all the reproducing kernel stuff are (Aronszajn 1950; Mercer 1909). It took a while to percolate into random function theory (Khintchine 1934; Yaglom 1987b) as covariance functions. Thence the idea arrived in statistical inference (Emanuel. Parzen 1962; E. Parzen 1963, 1959) and signal processing (Aasnaes and Kailath 1973; Duttweiler and Kailath 1973a, 1973b; Gevers and Kailath 1973; T. Kailath and Geesey 1971, 1973; T. Kailath 1971b, 1971a, 1974; T. Kailath, Geesey, and Weinert 1972; T. Kailath and Duttweiler 1972; T. Kailath and Weinert 1975), and now it is ubiquitous.

Practically, kernel methods have problems with scalability to large data sets. To apply any such method you need to keep a full Gram matrix of inner products between every data point, which needs you to know, for $N$ data points, $N (N - 1) / 2$ entries of a symmetric matrix. If you need to invert that matrix the cost is $O (N^{3})$ , which means you need fancy tricks to handle large $N$ . Fancy tricks depend on what the actual model is, but include Sparse GPs, random-projection inversions, Markov approximations and presumably many more

I’m especially interested in the application of such tricks in

kernel regression
wide random NNs
Nonparametric kernel independence tests
~~Efficient kernel pre-image approximation~~
~~Connection between kernel PCA and clustering (Schölkopf et al. 1998; Williams 2001)~~ Turns out not all those applications are interesting to me.

1 Introductions

There are many primers on Mercer kernels and their connection to ML. Kenneth Tay’s intro is punchy. Il Shan Ng, Reproducing Kernel Hilbert Spaces & Machine Learning is good. See (Schölkopf and Smola 2002), which grinds out many connections with learning theory, or (Manton and Amblard 2015), which is more narrowly focused on just the Mercer-kernel part, and the topological and geometric properties of the spaces. (Ghojogh et al. 2021; Gori and Martínez-Herrero 2021; Gretton 2019). Cheney and Light (2009) is an approximation-theory perspective which does not especially concern itself with stochastic processes. I also seem to have bookmarked the following introductions (Vert, Tsuda, and Schölkopf 2004; Schölkopf et al. 1999; Schölkopf, Herbrich, and Smola 2001; Muller et al. 2001; Schölkopf and Smola 2003).

Alex Smola (who with, Bernhard Schölkopf) has his name on an intimidating proportion of publications in this area, also has all his publications online.

2 Kernel approximation

See kernel approximation.

3 RKHS distribution embedding

See integral probability metrics.

4 Specific kernels

See covariance functions.

5 Non-scalar-valued “kernels”

Extending the usual inner-product framing, Operator-valued kernels, (Micchelli and Pontil 2005a; Evgeniou, Micchelli, and Pontil 2005; Álvarez, Rosasco, and Lawrence 2012), generalize to $k : X \times X \mapsto L (H_{Y})$ , as seen in multi-task learning.

6 Tools

6.1 KeOps

File under least squares, autodiff, gps, pytorch.

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbours queries, N-body interactions, point cloud convolutions and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

6.2 Falkon

Falkon (Meanti et al. 2020)

A Python library for large-scale kernel methods, with optional (multi-)GPU acceleration.

The library currently includes two solvers: one for approximate kernel ridge regression Rudi, Carratino, and Rosasco (2017) which is extremely fast, and one for kernel logistic regression Marteau-Ferey, Bach, and Rudi (2019) which trades off lower speed for better accuracy on binary classification problems.

The main features of Falkon are:

Full multi-GPU support - All compute-intensive parts of the algorithms are multi-GPU capable.

Extreme scalability - Unlike other kernel solvers, we keep memory usage in check. We have tested the library with datasets of billions of points.

Sparse data support

Scikit-learn integration - Our estimators follow the scikit-learn API

7 References

Aasnaes, and Kailath. 1973. “An Innovations Approach to Least-Squares Estimation–Part VII: Some Applications of Vector Autoregressive-Moving Average Models.” IEEE Transactions on Automatic Control.

Agarwal, and Iii. 2011. “Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

Agrawal, and Broderick. 2021. “The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time.” arXiv:2106.12408 [Stat].

Agrawal, Trippe, Huggins, et al. 2019. “The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions.” In Proceedings of the 36th International Conference on Machine Learning.

Alaoui, and Mahoney. 2014. “Fast Randomized Kernel Methods With Statistical Guarantees.” arXiv:1411.0306 [Cs, Stat].

Altun, Smola, and Hofmann. 2004. “Exponential Families for Conditional Random Fields.” In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. UAI ’04.

Álvarez, Rosasco, and Lawrence. 2012. “Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning.

Arbel, Korba, Salim, et al. 2019. “Maximum Mean Discrepancy Gradient Flow.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Aronszajn. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society.

Azangulov, Smolensky, Terenin, et al. 2022. “Stationary Kernels and Gaussian Processes on Lie Groups and Their Homogeneous Spaces I: The Compact Case.”

Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08.

Bach, Francis R. 2013. “Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In COLT.

Bach, Francis. 2015. “On the Equivalence Between Kernel Quadrature Rules and Random Feature Expansions.”

Backurs, Indyk, and Schmidt. 2017. “On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks.” arXiv:1704.02958 [Cs, Stat].

Bakır, Zien, and Tsuda. 2004. “Learning to Find Graph Pre-Images.” In Pattern Recognition. Lecture Notes in Computer Science 3175.

Balog, Lakshminarayanan, Ghahramani, et al. 2016. “The Mondrian Kernel.” arXiv:1606.05241 [Stat].

Ben-Hur, Ong, Sonnenburg, et al. 2008. “Support Vector Machines and Kernels for Computational Biology.” PLoS Comput Biol.

Bosq, and Blanke. 2007. Inference and prediction in large dimensions. Wiley series in probability and statistics.

Boyer, Chambolle, De Castro, et al. 2018. “On Representer Theorems and Convex Regularization.” arXiv:1806.09810 [Cs, Math].

Brown, and Lin. 2004. “Statistical Properties of the Method of Regularization with Periodic Gaussian Reproducing Kernel.” The Annals of Statistics.

Burges. 1998. “Geometry and Invariance in Kernel Based Methods.” In Advances in Kernel Methods - Support Vector Learning.

Canu, and Smola. 2006. “Kernel Methods and the Exponential Family.” Neurocomputing.

Carrasco, Oncina, and Calera-Rubio. 2001. “Stochastic Inference of Regular Tree Languages.” Machine Learning.

Cawley, and Talbot. 2005. “A Simple Trick for Constructing Bayesian Formulations of Sparse Kernel Learning Methods.” In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.

Chatfield, Lempitsky, Vedaldi, et al. 2011. “The Devil Is in the Details: An Evaluation of Recent Feature Encoding Methods.”

Cheney, and Light. 2009. A Course in Approximation Theory.

Choromanski, and Sindhwani. 2016. “Recycling Randomness with Structure for Sublinear Time Kernel Expansions.” arXiv:1605.09049 [Cs, Stat].

Chwialkowski, Strathmann, and Gretton. 2016. “A Kernel Test of Goodness of Fit.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.

Clark, Florêncio, and Watkins. 2006. “Languages as Hyperplanes: Grammatical Inference with String Kernels.” In Machine Learning: ECML 2006. Lecture Notes in Computer Science 4212.

Clark, Florêncio, Watkins, et al. 2006. “Planar Languages and Learnability.” In Grammatical Inference: Algorithms and Applications. Lecture Notes in Computer Science 4201.

Clark, and Watkins. 2008. “Some Alternatives to Parikh Matrices Using String Kernels.” Fundamenta Informaticae.

Collins, and Duffy. 2002. “Convolution Kernels for Natural Language.” In Advances in Neural Information Processing Systems 14.

Cortes, Haffner, and Mohri. 2004. “Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research.

Cucker, and Smale. 2002. “On the Mathematical Foundations of Learning.” Bulletin of the American Mathematical Society.

Cunningham, Shenoy, and Sahani. 2008. “Fast Gaussian Process Methods for Point Process Intensity Estimation.” In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.

Curtain. 1975. “Infinite-Dimensional Filtering.” SIAM Journal on Control.

Danafar, Fukumizu, and Gomez. 2014. “Kernel-Based Information Criterion.” arXiv:1408.5810 [Stat].

Devroye, Györfi, and Lugosi. 1996. A Probabilistic Theory of Pattern Recognition.

Domingos. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [Cs, Stat].

Drineas, and Mahoney. 2005. “On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research.

Duttweiler, and Kailath. 1973a. “RKHS Approach to Detection and Estimation Problems–IV: Non-Gaussian Detection.” IEEE Transactions on Information Theory.

———. 1973b. “RKHS Approach to Detection and Estimation Problems–V: Parameter Estimation.” IEEE Transactions on Information Theory.

Duvenaud, Lloyd, Grosse, et al. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).

Evgeniou, Micchelli, and Pontil. 2005. “Learning Multiple Tasks with Kernel Methods.” Journal of Machine Learning Research.

Feragen, and Hauberg. 2016. “Open Problem: Kernel Methods on Manifolds and Metric Spaces. What Is the Probability of a Positive Definite Geodesic Exponential Kernel?” In Conference on Learning Theory.

FitzGerald, Liukus, Rafii, et al. 2013. “Harmonic/Percussive Separation Using Kernel Additive Modelling.” In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET.

Flaxman, Teh, and Sejdinovic. 2016. “Poisson Intensity Estimation with Reproducing Kernels.” arXiv:1610.08623 [Stat].

Friedlander, Kailath, and Ljung. 1975. “Scattering Theory and Linear Least Squares Estimation: Part II: Discrete-Time Problems.” In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes.

Genton. 2001. “Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research.

Gevers, and Kailath. 1973. “An Innovations Approach to Least-Squares Estimation–Part VI: Discrete-Time Innovations Representations and Recursive Estimation.” IEEE Transactions on Automatic Control.

Ghojogh, Ghodsi, Karray, et al. 2021. “Reproducing Kernel Hilbert Space, Mercer’s Theorem, Eigenfunctions, Nystr"om Method, and Use of Kernels in Machine Learning: Tutorial and Survey.”

Globerson, and Livni. 2016. “Learning Infinite-Layer Networks: Beyond the Kernel Trick.” arXiv:1606.05316 [Cs].

Gorham, Raj, and Mackey. 2020. “Stochastic Stein Discrepancies.” arXiv:2007.02857 [Cs, Math, Stat].

Gori, and Martínez-Herrero. 2021. “Reproducing Kernel Hilbert Spaces for Wave Optics: Tutorial.” JOSA A.

Gottwald, and Reich. 2020. “Supervised Learning from Noisy Observations: Combining Machine-Learning Techniques with Data Assimilation.” arXiv:2007.07383 [Physics, Stat].

Grauman, and Darrell. 2005. “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features.” In Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005.

Greengard, and Strain. 1991. “The Fast Gauss Transform.” SIAM Journal on Scientific and Statistical Computing.

Gretton. 2019. “Introduction to RKHS, and Some Simple Kernel Algorithms.”

Gretton, Borgwardt, Rasch, et al. 2012. “A Kernel Two-Sample Test.” The Journal of Machine Learning Research.

Gretton, Fukumizu, Teo, et al. 2008. “A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference.

Grosse, Salakhutdinov, Freeman, et al. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Grünewälder, Gretton, and Shawe-Taylor. 2013. “Smooth Operators.” In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. ICML’13.

Györfi, ed. 2002. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics.

Haussler. 1999. “Convolution Kernels on Discrete Structures.”

Heinonen, and d’Alché-Buc. 2014. “Learning Nonparametric Differential Equations with Operator-Valued Kernels and Gradient Matching.” arXiv:1411.5172 [Cs, Stat].

Hofmann, Schölkopf, and Smola. 2008. “Kernel Methods in Machine Learning.” The Annals of Statistics.

Ishikawa, Fujii, Ikeda, et al. 2018. “Metric on Nonlinear Dynamical Systems with Perron-Frobenius Operators.” arXiv:1805.12324 [Cs, Math, Stat].

Jain. 2009. “Structure Spaces.” Journal of Machine Learning Research.

Jung. 2013. “An RKHS Approach to Estimation with Sparsity Constraints.” In Advances in Neural Information Processing Systems 29.

Kailath, Thomas. 1971. “The Structure of Radon-Nikodym Derivatives with Respect to Wiener and Related Measures.” The Annals of Mathematical Statistics.

Kailath, T. 1971a. “RKHS Approach to Detection and Estimation Problems–I: Deterministic Signals in Gaussian Noise.” IEEE Transactions on Information Theory.

———. 1971b. “A Note on Least-Squares Estimation by the Innovations Method.” In 1971 IEEE Conference on Decision and Control.

———. 1974. “A View of Three Decades of Linear Filtering Theory.” IEEE Transactions on Information Theory.

Kailath, T., and Duttweiler. 1972. “An RKHS Approach to Detection and Estimation Problems– III: Generalized Innovations Representations and a Likelihood-Ratio Formula.” IEEE Transactions on Information Theory.

Kailath, T., and Geesey. 1971. “An Innovations Approach to Least Squares Estimation–Part IV: Recursive Estimation Given Lumped Covariance Functions.” IEEE Transactions on Automatic Control.

———. 1973. “An Innovations Approach to Least-Squares Estimation–Part V: Innovations Representations and Recursive Estimation in Colored Noise.” IEEE Transactions on Automatic Control.

Kailath, T., Geesey, and Weinert. 1972. “Some Relations Among RKHS Norms, Fredholm Equations, and Innovations Representations.” IEEE Transactions on Information Theory.

Kailath, T., and Weinert. 1975. “An RKHS Approach to Detection and Estimation Problems–II: Gaussian Signal Detection.” IEEE Transactions on Information Theory.

Kanagawa, and Fukumizu. 2014. “Recovering Distributions from Gaussian RKHS Embeddings.” In Journal of Machine Learning Research.

Kanagawa, Hennig, Sejdinovic, et al. 2018. “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences.” arXiv:1807.02582 [Cs, Stat].

Katharopoulos, Vyas, Pappas, et al. 2020. “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [Cs, Stat].

Kemerait, and Childers. 1972. “Signal Detection and Extraction by Cepstrum Techniques.” IEEE Transactions on Information Theory.

Keriven, Bourrier, Gribonval, et al. 2016. “Sketching for Large-Scale Learning of Mixture Models.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Khintchine. 1934. “Korrelationstheorie der stationären stochastischen Prozesse.” Mathematische Annalen.

Kimeldorf, and Wahba. 1970. “A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines.” The Annals of Mathematical Statistics.

Kiraly, and Oberhauser. 2019. “Kernels for Sequentially Ordered Data.” Journal of Machine Learning Research.

Kloft, Rückert, and Bartlett. 2010. “A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.

Klus, Bittracher, Schuster, et al. 2018. “A Kernel-Based Approach to Molecular Conformation Analysis.” The Journal of Chemical Physics.

Kontorovich, Leonid, Cortes, and Mohri. 2006. “Learning Linearly Separable Languages.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.

Kontorovich, Leonid (Aryeh), Cortes, and Mohri. 2008. “Kernel Methods for Learning Languages.” Theoretical Computer Science, Algorithmic Learning Theory,.

Koppel, Warnell, Stump, et al. 2016. “Parsimonious Online Learning with Kernels via Sparse Projections in Function Space.” arXiv:1612.04111 [Cs, Stat].

Krauth, Bonilla, Cutajar, et al. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.

Kulis, and Grauman. 2012. “Kernelized Locality-Sensitive Hashing.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Lawrence, Seeger, and Herbrich. 2003. “Fast Sparse Gaussian Process Methods: The Informative Vector Machine.” In Proceedings of the 16th Annual Conference on Neural Information Processing Systems.

Ley, Reinert, and Swan. 2017. “Stein’s Method for Comparison of Univariate Distributions.” Probability Surveys.

Liu, Qiang, Lee, and Jordan. 2016. “A Kernelized Stein Discrepancy for Goodness-of-Fit Tests.” In Proceedings of The 33rd International Conference on Machine Learning.

Liutkus, Rafii, Pardo, et al. 2014. “Kernel Spectrogram Models for Source Separation.” In.

Liu, Xi, Zhan, and Niu. 2021. “Hilbert–Schmidt Independence Criterion Regularization Kernel Framework on Symmetric Positive Definite Manifolds.” Mathematical Problems in Engineering.

Ljung, and Kailath. 1976. “Backwards Markovian Models for Second-Order Stochastic Processes (Corresp.).” IEEE Transactions on Information Theory.

Ljung, Kailath, and Friedlander. 1975. “Scattering Theory and Linear Least Squares Estimation: Part I: Continuous-Time Problems.” In 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes.

Lloyd, Duvenaud, Grosse, et al. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence.

Lodhi, Saunders, Shawe-Taylor, et al. 2002. “Text Classification Using String Kernels.” Journal of Machine Learning Research.

Lopez-Paz, Nishihara, Chintala, et al. 2016. “Discovering Causal Signals in Images.” arXiv:1605.08179 [Cs, Stat].

Lu, Leen, Huang, et al. 2008. “A Reproducing Kernel Hilbert Space Framework for Pairwise Time Series Distances.” In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.

Ma, Siyuan, and Belkin. 2017. “Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning.” arXiv:1703.10622 [Cs, Stat].

Ma, Wan-Duo Kurt, Lewis, and Kleijn. 2020. “The HSIC Bottleneck: Deep Learning Without Back-Propagation.” Proceedings of the AAAI Conference on Artificial Intelligence.

Manton, and Amblard. 2015. “A Primer on Reproducing Kernel Hilbert Spaces.” Foundations and Trends® in Signal Processing.

Marteau-Ferey, Bach, and Rudi. 2019. “Globally Convergent Newton Methods for Ill-Conditioned Generalized Self-Concordant Losses.” In Advances in Neural Information Processing Systems.

———. 2020. “Non-Parametric Models for Non-Negative Functions.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.

McFee, and Ellis. 2011. “Analyzing Song Structure with Spectral Clustering.” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Meanti, Carratino, Rosasco, et al. 2020. “Kernel Methods Through the Roof: Handling Billions of Points Efficiently.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.

Meidan. 1980. “On the Connection Between Ordinary and Generalized Stochastic Processes.” Journal of Mathematical Analysis and Applications.

Mercer. 1909. “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character.

Micchelli, and Pontil. 2005a. “Learning the Kernel Function via Regularization.” Journal of Machine Learning Research.

———. 2005b. “On Learning Vector-Valued Functions.” Neural Computation.

Minh. 2022. “Finite Sample Approximations of Exact and Entropic Wasserstein Distances Between Covariance Operators and Gaussian Processes.” SIAM/ASA Journal on Uncertainty Quantification.

Muandet, Fukumizu, Sriperumbudur, et al. 2014. “Kernel Mean Shrinkage Estimators.” arXiv:1405.5505 [Cs, Stat].

Muandet, Fukumizu, Sriperumbudur, et al. 2017. “Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends® in Machine Learning.

Muller, Mika, Ratsch, et al. 2001. “An Introduction to Kernel-Based Learning Algorithms.” IEEE Transactions on Neural Networks.

Nishiyama, and Fukumizu. 2016. “Characteristic Kernels and Infinitely Divisible Distributions.” The Journal of Machine Learning Research.

Noack, Luo, and Risser. 2023. “A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes.”

Parzen, Emanuel. 1959. “Statistical Inference On Time Series By Hilbert Space Methods, I.” TR23.

Parzen, Emanuel. 1962. “Extraction and Detection Problems and Reproducing Kernel Hilbert Spaces.” Journal of the Society for Industrial and Applied Mathematics Series A Control.

Parzen, Emanuel. 1963. “Probability Density Functionals and Reproducing Kernel Hilbert Spaces.” In Proceedings of the Symposium on Time Series Analysis.

Pillonetto. 2016. “The Interplay Between System Identification and Machine Learning.” arXiv:1612.09158 [Cs, Stat].

Poggio, and Girosi. 1990. “Networks for Approximation and Learning.” Proceedings of the IEEE.

Rahimi, and Recht. 2007. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems.

———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems.

Ramdas, and Wehbe. 2014. “Stein Shrinkage for Cross-Covariance Operators and Kernel Independence Testing.” arXiv:1406.1922 [Stat].

Raykar, and Duraiswami. 2005. “The Improved Fast Gauss Transform with Applications to Machine Learning.”

Rudi, Carratino, and Rosasco. 2017. “FALKON: An Optimal Large Scale Kernel Method.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.

Rue, and Held. 2005. Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability 104.

Rustamov. 2021. “Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat.

Sachdeva, Dhaliwal, Wu, et al. 2022. “Infinite Recommendation Networks: A Data-Centric Approach.”

Saha, and Balamurugan. 2020. “Learning with Operator-Valued Kernels in Reproducing Kernel Krein Spaces.” In Advances in Neural Information Processing Systems.

Salvi, Cass, Foster, et al. 2021. “The Signature Kernel Is the Solution of a Goursat PDE.” SIAM Journal on Mathematics of Data Science.

Salvi, Lemercier, Liu, et al. 2024. “Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes.” In Advances in Neural Information Processing Systems. NIPS ’21.

Särkkä. 2011. “Linear Operators and Stochastic Partial Differential Equations in Gaussian Process Regression.” In Artificial Neural Networks and Machine Learning – ICANN 2011. Lecture Notes in Computer Science.

Schaback, and Wendland. 2006. “Kernel Techniques: From Machine Learning to Meshless Methods.” Acta Numerica.

Schlegel. 2018. “When Is There a Representer Theorem? Reflexive Banach Spaces.” arXiv:1809.10284 [Cs, Math, Stat].

Schölkopf, Herbrich, and Smola. 2001. “A Generalized Representer Theorem.” In Computational Learning Theory. Lecture Notes in Computer Science.

Schölkopf, Knirsch, Smola, et al. 1998. “Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces.” In Mustererkennung 1998. Informatik Aktuell.

Schölkopf, Mika, Burges, et al. 1999. “Input Space Versus Feature Space in Kernel-Based Methods.” IEEE Transactions on Neural Networks.

Schölkopf, Muandet, Fukumizu, et al. 2015. “Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.” arXiv:1501.06794 [Cs, Stat].

Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.

———. 2003. “A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.

Schölkopf, Smola, and Müller. 1997. “Kernel Principal Component Analysis.” In Artificial Neural Networks — ICANN’97. Lecture Notes in Computer Science.

Schuster, Mollenhauer, Klus, et al. 2019. “Kernel Conditional Density Operators.” arXiv:1905.11255 [Cs, Math, Stat].

Schuster, Strathmann, Paige, et al. 2017. “Kernel Sequential Monte Carlo.” In ECML-PKDD 2017.

Segall, Davis, and Kailath. 1975. “Nonlinear Filtering with Counting Observations.” IEEE Transactions on Information Theory.

Segall, and Kailath. 1976. “Orthogonal Functionals of Independent-Increment Processes.” IEEE Transactions on Information Theory.

Shen, Baingana, and Giannakis. 2016. “Nonlinear Structural Vector Autoregressive Models for Inferring Effective Brain Network Connectivity.” arXiv:1610.06551 [Stat].

Smola, A. J., and Schölkopf. 1998. “On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion.” Algorithmica.

Smola, Alex J., and Schölkopf. 2000. “Sparse Greedy Matrix Approximation for Machine Learning.”

———. 2004. “A Tutorial on Support Vector Regression.” Statistics and Computing.

Smola, Alex J., Schölkopf, and Müller. 1998. “The Connection Between Regularization Operators and Support Vector Kernels.” Neural Networks.

Snelson, and Ghahramani. 2005. “Sparse Gaussian Processes Using Pseudo-Inputs.” In Advances in Neural Information Processing Systems.

Solin, and Särkkä. 2020. “Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing.

Song, Fukumizu, and Gretton. 2013. “Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models.” IEEE Signal Processing Magazine.

Song, Gretton, Bickson, et al. 2011. “Kernel Belief Propagation.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

Sriperumbudur, Gretton, Fukumizu, et al. 2008. “Injective Hilbert Space Embeddings of Probability Measures.” In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).

Steinwart. 2020. “Reproducing Kernel Hilbert Spaces Cannot Contain All Continuous Functions on a Compact Metric Space.” arXiv:2002.03171 [Cs, Math].

Székely, and Rizzo. 2009. “Brownian Distance Covariance.” The Annals of Applied Statistics.

Székely, Rizzo, and Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” The Annals of Statistics.

Tipping, and Nh. 2001. “Sparse Kernel Principal Component Analysis.” In Advances in Neural Information Processing Systems 13.

Tompkins, and Ramos. 2018. “Fourier Feature Approximations for Periodic Kernels in Time-Series Modelling.” Proceedings of the AAAI Conference on Artificial Intelligence.

Tsuchida, Ong, and Sejdinovic. 2023. “Squared Neural Families: A New Class of Tractable Density Models.”

Vedaldi, and Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Vert, Tsuda, and Schölkopf. 2004. “A Primer on Kernel Methods.” In Kernel Methods in Computational Biology.

Vishwanathan, Schraudolph, Kondor, et al. 2010. “Graph Kernels.” Journal of Machine Learning Research.

Walder, Christian, Kim, and Schölkopf. 2008. “Sparse Multiscale Gaussian Process Regression.” In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.

Walder, C., Schölkopf, and Chapelle. 2006. “Implicit Surface Modelling with a Globally Regularised Basis of Compact Support.” Computer Graphics Forum.

Wang, Smola, and Tibshirani. 2014. “The Falling Factorial Basis and Its Statistical Applications.” In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14.

Weinert, Howard L. 1978. “Statistical Methods in Optimal Curve Fitting.” Communications in Statistics - Simulation and Computation.

Weinert, Howard L., and Kailath. 1974. “Stochastic Interpretations and Recursive Algorithms for Spline Functions.” The Annals of Statistics.

Weinert, H., and Sidhu. 1978. “A Stochastic Framework for Recursive Computation of Spline Functions–Part I: Interpolating Splines.” IEEE Transactions on Information Theory.

Williams. 2001. “On a Connection Between Kernel PCA and Metric Multidimensional Scaling.” In Advances in Neural Information Processing Systems 13.

Wilson, and Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning.

Wilson, Dann, Lucas, et al. 2015. “The Human Kernel.” arXiv:1510.07389 [Cs, Stat].

Wu, and Zhou. 2008. “Learning with Sample Dependent Hypothesis Spaces.” Computers & Mathematics with Applications.

Xu, Wenkai, and Matsuda. 2020. “A Stein Goodness-of-Fit Test for Directional Distributions.” In International Conference on Artificial Intelligence and Statistics.

———. 2021. “Interpretable Stein Goodness-of-Fit Tests on Riemannian Manifolds.” arXiv:2103.00895 [Stat].

Xu, Jian-Wu, Paiva, Park, et al. 2008. “A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning.” IEEE Transactions on Signal Processing.

Xu, Wenkai, and Reinert. 2021. “A Stein Goodness of Fit Test for Exponential Random Graph Models.” arXiv:2103.00580 [Stat].

Yaglom. 1987a. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics.

———. 1987b. Correlation Theory of Stationary and Related Random Functions Volume I.

———. 2004. An Introduction to the Theory of Stationary Random Functions.

Yang, Changjiang, Duraiswami, and Davis. 2004. “Efficient Kernel Machines Using the Improved Fast Gauss Transform.” In Advances in Neural Information Processing Systems.

Yang, Changjiang, Duraiswami, Gumerov, et al. 2003. “Improved Fast Gauss Transform and Efficient Kernel Density Estimation.” In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. ICCV ’03.

Yang, Tianbao, Li, Mahdavi, et al. 2012. “Nyström Method Vs Random Fourier Features: A Theoretical and Empirical Comparison.” In Advances in Neural Information Processing Systems.

Yang, Jiyan, Sindhwani, Avron, et al. 2014. “Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels.” arXiv:1412.8293 [Cs, Math, Stat].

Yu, Cheng, Schuurmans, et al. 2013. “Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).

Zhang, Qinyi, Filippi, Gretton, et al. 2016. “Large-Scale Kernel Methods for Independence Testing.” arXiv:1606.07892 [Stat].

Zhang, Kun, Peters, Janzing, et al. 2012. “Kernel-Based Conditional Independence Test and Application in Causal Discovery.” arXiv:1202.3775 [Cs, Stat].

Zhou, Zha, and Song. 2013. “Learning Triggering Kernels for Multi-Dimensional Hawkes Processes.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).