# Kernel approximation

A page where I document what I don’t know about kernel approximation. A page about what I do know would be empty.

What I mean is: approximating implicit Mercer kernel feature maps with explicit features; Equivalently, approximating the Gram matrix, which is also related to mixture model inference and clustering. I’m especially interested in how it might be done using random linear algebra via random embeddings such as random projections.

Also I guess we need to know trade-offs of computational time/space cost versus approximation bounds, so that we can decide when to bother. When is it enough to reduce computational cost merely with support vectors, or to evaluate the kernels efficiently using, e.g. the Fast Gauss Transform and related methods, rather than coming up with alternative kernel bases? (e.g. you don’t want to do the fiddly coding for the Faust Gauss transform)

Not the related but distinct (terminological collision) of approximating functions from [mixtures of kernels](./mixlow-dimensional random projections. while commonly used to approximate convolution kernels, can also approximate certain Mercer kernels but it’s not what I mean here - fast multipole methods gives you an approximation to the field generated by all kernels and all the support points In kernel approximation we look for approximations, in some sense, to the component kernels themselves.

## Stationary kernels

I am mostly interested in translation-invariant kernels, so assume I’m talking about those unless I say otherwise.

Short introduction at scikit-learn kernel approximation page.

The approximations might be random projection, or random sampling based, e.g. the Nyström method, which is reportedly effectively Monte Carlo integration, although I understand there is an optimisation step too?

I need to work out the difference between random Fourier features, random kitchen sinks, Nyström methods and whatever Smola et al call their special case Gaussian approximation. I think Random Fourier features are the same as random kitchen sinks, and Nyström is different . When we can exploit (data- or kernel-) structure to do better? (say, ) Quasi Monte Carlo can improve on random Monte Carlo? (update: someone already had that idea: ) Or better matrix factorisations?

Recently, Maji and Berg and Vedaldi and Zisserman proposed explicit feature maps to approximate the additive kernels (intersection, $$\chi^2$$, etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels.

## Inner product kernels

See e.g. . These have some elegant relations to low-dimensional random projections. How well do these work with NN kernels?

## Implementations

This is the LIBASKIT set of scalable machine learning and data analysis tools. Currently we provide codes for kernel sums, nearest-neighbors, kmeans clustering, kernel regression, and multiclass kernel logistic regression. All codes use OpenMP and MPI for shared memory and distributed memory parallelism.

[…] PNYSTR : (Parallel Nystrom method) Code for kernel summation using the Nystrom method.

## Connections

MDS and Kernel PCA and mixture models are all related in a way I should try to understand when I have a moment.

For the connection with kernel PCA, see and for metric multidimensional scaling.

use projection tricks to calculate Sinkhorn divergence.

## References

Alaoui, Ahmed El, and Michael W. Mahoney. 2014. “Fast Randomized Kernel Methods With Statistical Guarantees.” November 2, 2014. http://arxiv.org/abs/1411.0306.
Altschuler, Jason, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. 2019. “Massively Scalable Sinkhorn Distances via the Nyström Method.” In Advances in Neural Information Processing Systems 32, 11. Curran Associates, Inc. https://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdf.
Bach, Francis. 2015. “On the Equivalence Between Kernel Quadrature Rules and Random Feature Expansions.” 2015. http://arxiv.org/abs/1502.06800.
Bach, Francis R. 2013. “Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In COLT, 30:185–209. http://www.jmlr.org/proceedings/papers/v30/Bach13.pdf.
Bakır, Gökhan H., Jason Weston, and Bernhard Schölkopf. 2004. “Learning to Find Pre-Images.” Advances in Neural Information Processing Systems 16 (7): 449–56. http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/pdf2262.pdf.
Beylkin, Gregory, and Lucas Monzón. 2010. “Approximation by Exponential Sums Revisited.” Applied and Computational Harmonic Analysis 28 (2): 131–49. https://doi.org/10.1016/j.acha.2009.08.011.
Brault, Romain, Florence d’Alché-Buc, and Markus Heinonen. 2016. “Random Fourier Features for Operator-Valued Kernels.” In Proceedings of The 8th Asian Conference on Machine Learning, 110–25. http://arxiv.org/abs/1605.02536.
Brault, Romain, Néhémy Lim, and Florence d’Alché-Buc. n.d. “Scaling up Vector Autoregressive Models With Operator-Valued Random Fourier Features.” Accessed August 31, 2016. https://aaltd16.irisa.fr/files/2016/08/AALTD16_paper_11.pdf.
Cheney, Elliott Ward, and William Allan Light. 2009. A Course in Approximation Theory. American Mathematical Soc. https://books.google.com.au/books?hl=en&lr=&id=II6DAwAAQBAJ&oi=fnd&pg=PA1&ots=ch9-LyxDg6&sig=jetWpIErExYvlnnSsup-5yHhso0.
Cheng, Xiuyuan, and Amit Singer. 2013. “The Spectrum of Random Inner-Product Kernel Matrices.” Random Matrices: Theory and Applications 02 (04): 1350010. https://doi.org/10.1142/S201032631350010X.
Choromanski, Krzysztof, Mark Rowland, and Adrian Weller. 2017. “The Unreasonable Effectiveness of Random Orthogonal Embeddings.” March 2, 2017. http://arxiv.org/abs/1703.00864.
Choromanski, Krzysztof, and Vikas Sindhwani. 2016. “Recycling Randomness with Structure for Sublinear Time Kernel Expansions.” May 29, 2016. http://arxiv.org/abs/1605.09049.
Curto, Joachim, Irene Zarza, Feng Yang, Alexander J. Smola, and Luc Van Gool. 2017. F2f: A Library For Fast Kernel Expansions.” February 27, 2017. http://arxiv.org/abs/1702.08159.
Cutajar, Kurt, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In PMLR. http://proceedings.mlr.press/v70/cutajar17a.html.
Drineas, Petros, and Michael W. Mahoney. 2005. “On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research 6 (December): 2153–75. http://jmlr.org/papers/volume6/drineas05a/drineas05a.pdf.
El Karoui, Noureddine. 2010. “The Spectrum of Kernel Random Matrices.” The Annals of Statistics 38 (1). https://doi.org/10.1214/08-AOS648.
Globerson, Amir, and Roi Livni. 2016. “Learning Infinite-Layer Networks: Beyond the Kernel Trick.” June 16, 2016. http://arxiv.org/abs/1606.05316.
Kar, Purushottam, and Harish Karnick. 2012. “Random Feature Maps for Dot Product Kernels.” In Artificial Intelligence and Statistics, 583–91. PMLR. http://proceedings.mlr.press/v22/kar12.html.
Koltchinskii, Vladimir, and Evarist Giné. 2000. “Random Matrix Approximation of Spectra of Integral Operators.” Bernoulli 6 (1): 113–67. https://doi.org/10.2307/3318636.
Kwok, J. T.-Y., and I. W.-H. Tsang. 2004. “The Pre-Image Problem in Kernel Methods.” IEEE Transactions on Neural Networks 15 (6): 1517–25. https://doi.org/10.1109/TNN.2004.837781.
Le, Quoc, Tamás Sarlós, and Alex Smola. 2013. “Fastfood-Approximating Kernel Expansions in Loglinear Time.” In Proceedings of the International Conference on Machine Learning. http://www.jmlr.org/proceedings/papers/v28/le13-supp.pdf.
Minh, Ha Quang, Partha Niyogi, and Yuan Yao. 2006. “Mercer’s Theorem, Feature Maps, and Smoothing.” In International Conference on Computational Learning Theory, 154–68. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/11776420_14.
Pourkamali-Anaraki, Farhad, and Stephen Becker. 2016a. “A Randomized Approach to Efficient Kernel Clustering.” August 26, 2016. http://arxiv.org/abs/1608.07597.
———. 2016b. “Randomized Clustered Nystrom for Large-Scale Kernel Machines.” December 19, 2016. http://arxiv.org/abs/1612.06470.
Rahimi, Ali, and Benjamin Recht. 2007. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems, 1177–84. Curran Associates, Inc. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.
———. 2008. “Uniform Approximation of Functions with Random Bases.” In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, 555–61. https://doi.org/10.1109/ALLERTON.2008.4797607.
———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems, 1313–20. Curran Associates, Inc. http://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.
Scardapane, Simone, and Dianhui Wang. 2017. “Randomness in Neural Networks: An Overview.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2). https://doi.org/10.1002/widm.1200.
Schölkopf, Bernhard, Phil Knirsch, Alex Smola, and Chris Burges. 1998. “Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces.” In Mustererkennung 1998, edited by Paul Levi, Michael Schanz, Rolf-Jürgen Ahlers, and Franz May, 125–32. Informatik Aktuell. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-72282-0_12.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1997. “Kernel Principal Component Analysis.” In Artificial Neural NetworksICANN’97, edited by Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, 583–88. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0020217.
Sterge, Nicholas, and Bharath Sriperumbudur. 2021. “Statistical Optimality and Computational Efficiency of Nyström Kernel PCA.” May 18, 2021. http://arxiv.org/abs/2105.08875.
Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. 2017. “Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery.” February 13, 2017. http://arxiv.org/abs/1702.03877.
Tompkins, Anthony, and Fabio Ramos. 2018. “Fourier Feature Approximations for Periodic Kernels in Time-Series Modelling.” Proceedings of the AAAI Conference on Artificial Intelligence 32 (1, 1). https://ojs.aaai.org/index.php/AAAI/article/view/11696.
Vedaldi, A., and A. Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3, 3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.
Vempati, Sreekanth, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2010. “Generalized RBF Feature Maps for Efficient Detection.” In Procedings of the British Machine Vision Conference 2010, 2.1–11. Aberystwyth: British Machine Vision Association. https://doi.org/10.5244/C.24.2.
Williams, Christopher K. I. 2001. “On a Connection Between Kernel PCA and Metric Multidimensional Scaling.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 46:675–81. MIT Press. https://doi.org/10.1023/A:1012485807823.
Williams, Christopher KI, and Matthias Seeger. 2001. “Using the Nyström Method to Speed Up Kernel Machines.” In Advances in Neural Information Processing Systems, 682–88. http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.
Yang, Jiyan, Vikas Sindhwani, Haim Avron, and Michael Mahoney. 2014. “Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels.” December 29, 2014. http://arxiv.org/abs/1412.8293.
Yang, Tianbao, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. 2012. “Nyström Method Vs Random Fourier Features: A Theoretical and Empirical Comparison.” In Advances in Neural Information Processing Systems, 476–84. http://papers.nips.cc/paper/4588-nystrom-method-vs-random-fourier-features-a-theoretical-and-empirical-comparison.
Yu, Chenhan D., William B. March, and George Biros. 2017. “An $N \log N$ Parallel Fast Direct Solver for Kernel Matrices.” In. http://arxiv.org/abs/1701.02324.
Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. “Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.
Zhang, Jaslyn. 2017. “Improved Genomic Selection Using Vowpal Wabbit with Random Fourier Features.” https://dukespace.lib.duke.edu/dspace/handle/10161/14308.
Zhang, Qinyi, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. 2016. “Large-Scale Kernel Methods for Independence Testing.” June 25, 2016. http://arxiv.org/abs/1606.07892.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.