A page where I document whet I *don’t* know about kernel approximation.
A page about what I *do* know would be empty.

What I mean is: approximating implicit Mercer kernel features with explicit features; Equivalently, approximating the Gram matrix, which is also related to mixture model inference and clustering. I’m especially interested in how it might be done using random linear algebra. I am mostly interested in translation-invariant kernels, so assume I’m talking about those unless I say otherwise.

*Not* the related but distinct (terminological collision)
of approximating functions from mixtures of kernels.
Also note that the Fast Gauss Transform and other related fast multipole methods,
while commonly used to approximate convolution kernels,
can also approximate certain Mercer kernels but it’s not what I mean here -
fast multipole methods gives you an approximation to the field generated
by all kernels and all the support points
In kernel approximation we look for approximations, in some sense, to
*the component kernels themselves*.

Short introduction at scikit-learn kernel approximation page.

(Drineas and Mahoney 2005; T. Yang et al. 2012, 2012; Vedaldi and Zisserman 2012; Rahimi and Recht 2007, 2009; Alaoui and Mahoney 2014; F. Bach 2015; F. R. Bach 2013), have work here.

The approximations might be random projection, or random sampling based, e.g. the Nyström method, which is reportedly effectively Monte Carlo integration, although I understand there is an optimisation step too?

I need to work out the difference between random Fourier features, random kitchen sinks, Nyström methods and whatever Smola et al (Schölkopf et al. 1998) call their special case Gaussian approximation. I think Random Fourier features are the same as random kitchen sinks, Rahimi and Recht (2007) and Nyström is different (C. K. Williams and Seeger 2001). When we can exploit (data- or kernel-) structure to do better? (say, (Le, Sarlós, and Smola 2013)) Quasi Monte Carlo can improve on random Monte Carlo? (update: someone already had that idea: (J. Yang et al. 2014)) Or better matrix factorisations?

Also I guess we need to know trade-offs of computational time/space cost versus approximation bounds, so that we can decide when to bother. When is it enough to reduce computational cost merely with support vectors, or to evaluate the kernels efficiently using, e.g. the Fast Gauss Transform and related methods methods, rather than coming up with alternative kernel bases? (e.g. you don’t want to do the fiddly coding for the Faust Gauss transform)

Does this help? Chris Ding’s Principal Component Analysis and Matrix Factorizations for Learning.

Recently, Maji and Berg and Vedaldi and Zisserman proposed explicit feature maps to approximate the additive kernels (intersection, \(\chi^2\), etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht (Rahimi and Recht 2007) for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels.

## Lectures: Fastfood etc

## Implementations

This is the LIBASKIT set of scalable machine learning and data analysis tools. Currently we provide codes for kernel sums, nearest-neighbors, kmeans clustering, kernel regression, and multiclass kernel logistic regression. All codes use OpenMP and MPI for shared memory and distributed memory parallelism.

[…] PNYSTR : (Parallel Nystrom method) Code for kernel summation using the Nystrom method.

## Connections

MDS and Kernel PCA and mixture models are all related in a way I should try to understand when I have a moment.

For the connection with kernel PCA, see Schölkopf et al. (1998) and C. K. I. Williams (2001) for metric multidimensional scaling.

Altschuler et al. (2019) uses these tricks to calculate Sinkhorn divergence.

## References

*Advances in Neural Information Processing Systems 32*, 11. Curran Associates, Inc. https://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdf.

*COLT*, 30:185–209. http://www.jmlr.org/proceedings/papers/v30/Bach13.pdf.

*Advances in Neural Information Processing Systems*16 (7): 449–56. http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/pdf2262.pdf.

*Applied and Computational Harmonic Analysis*28 (2): 131–49. https://doi.org/10.1016/j.acha.2009.08.011.

*Proceedings of The 8th Asian Conference on Machine Learning*, 110–25. http://arxiv.org/abs/1605.02536.

*A Course in Approximation Theory*. American Mathematical Soc. https://books.google.com.au/books?hl=en&lr=&id=II6DAwAAQBAJ&oi=fnd&pg=PA1&ots=ch9-LyxDg6&sig=jetWpIErExYvlnnSsup-5yHhso0.

*Proceedings of the 25th International Conference on Machine Learning*, 192–99. ICML ’08. New York, NY, USA: ACM Press. https://doi.org/10.1145/1390156.1390181.

*PMLR*. http://proceedings.mlr.press/v70/cutajar17a.html.

*Journal of Machine Learning Research*6 (December): 2153–75. http://jmlr.org/papers/volume6/drineas05a/drineas05a.pdf.

*IEEE Transactions on Neural Networks*15 (6): 1517–25. https://doi.org/10.1109/TNN.2004.837781.

*Proceedings of the International Conference on Machine Learning*. http://www.jmlr.org/proceedings/papers/v28/le13-supp.pdf.

*International Conference on Computational Learning Theory*, 154–68. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/11776420_14.

*Advances in Neural Information Processing Systems*, 1177–84. Curran Associates, Inc. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.

*2008 46th Annual Allerton Conference on Communication, Control, and Computing*, 555–61. https://doi.org/10.1109/ALLERTON.2008.4797607.

*Advances in Neural Information Processing Systems*, 1313–20. Curran Associates, Inc. http://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.

*Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*7 (2). https://doi.org/10.1002/widm.1200.

*Mustererkennung 1998*, edited by Paul Levi, Michael Schanz, Rolf-Jürgen Ahlers, and Franz May, 125–32. Informatik Aktuell. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-72282-0_12.

*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.

*Artificial Neural Networks — ICANN’97*, edited by Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, 583–88. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0020217.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*34 (3, 3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.

*Procedings of the British Machine Vision Conference 2010*, 2.1–11. Aberystwyth: British Machine Vision Association. https://doi.org/10.5244/C.24.2.

*Advances in Neural Information Processing Systems 13*, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 46:675–81. MIT Press. https://doi.org/10.1023/A:1012485807823.

*Advances in Neural Information Processing Systems*, 682–88. http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.

*Advances in Neural Information Processing Systems*, 476–84. http://papers.nips.cc/paper/4588-nystrom-method-vs-random-fourier-features-a-theoretical-and-empirical-comparison.

*Proceedings of the 30th International Conference on Machine Learning (ICML-13)*, 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.