Kernel approximation

July 27, 2016 — March 6, 2020

algebra
functional analysis
Hilbert space
linear algebra
model selection
nonparametric
probabilistic algorithms
probability
signal processing
sparser than thou
Figure 1

A page where I document what I don’t know about kernel approximation. A page about what I do know would be empty.

What I mean is: approximating implicit Mercer kernel feature maps with explicit features; Equivalently, approximating the Gram matrix, which is also related to mixture model inference and clustering.

I’m especially interested in how it might be done using random linear algebra via random embeddings such as random projections.

Also, I guess we need to know trade-offs of computational time/space cost in approximation bounds, so that we can decide when to bother. When is it enough to reduce computational cost merely with support vectors, or to evaluate the kernels efficiently using, e.g. the Fast Gauss Transform and related methods, rather than coming up with alternative kernel bases? (e.g. you don’t want to do the fiddly coding for the Fast Gauss transform)

Not the related but distinct (terminological collision) of approximating functions from convolution kernels. In kernel approximation, we look for approximations, in some sense, to the component kernels themselves.

A good introduction to what is going on is

1 Do you need to even calculate the Gram matrix?

Francis R. Bach and Jordan (2002) does a kernel ICA without calculating the whole Gram matrix (!). Does that fit in here?

2 Stationary kernels

I am mostly interested in translation-invariant (i.e. stationary, or homogeneous) kernels, so assume I’m talking about those unless I say otherwise.

Short introduction at scikit-learn kernel approximation page.

See (Drineas and Mahoney 2005; T. Yang et al. 2012, 2012; Vedaldi and Zisserman 2012; Rahimi and Recht 2007, 2009; Alaoui and Mahoney 2014; F. Bach 2015; Francis R. Bach 2013; Solin and Särkkä 2020).

The approximations might be random projection, or random sampling based, e.g. the Nyström method, which is reportedly effectively Monte Carlo integration, although I understand there is an optimisation step too?

I need to work out the difference between random Fourier features, random kitchen sinks, Nyström methods and whatever Smola et al (Schölkopf et al. 1998) call their special case Gaussian approximation. I think Random Fourier features are the same as random kitchen sinks, Rahimi and Recht (2007) and Nyström is different (C. K. Williams and Seeger 2001). When can we exploit (data- or kernel-) structure to do better? (say, (Le, Sarlós, and Smola 2013)) Quasi Monte Carlo can improve on random Monte Carlo? (update: someone already had that idea: (J. Yang et al. 2014)) Or better matrix factorisations?

Vempati et al. (2010):

Recently, Maji and Berg and Vedaldi and Zisserman proposed explicit feature maps to approximate the additive kernels (intersection, \(\chi^2\), etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht (Rahimi and Recht 2007) for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels.

3 Using bases

See GP basis methods.

4 Inner product kernels

See e.g. Koltchinskii and Giné (2000). These have some elegant relations to low-dimensional random projections. How well can we approximate NN kernels?

5 Lectures

Figure 2: InkedIcon, on inversions in Hilbert space, after Charles Stross

6 Connections

MDS and Kernel PCA and mixture models are all related in a way I should try to understand when I have a moment.

For the connection with kernel PCA, see Schölkopf et al. (1998) and C. K. I. Williams (2001) for metric multidimensional scaling.

Altschuler et al. (2019) use projection tricks to calculate Sinkhorn divergence.

7 References

Alaoui, and Mahoney. 2014. Fast Randomized Kernel Methods With Statistical Guarantees.” arXiv:1411.0306 [Cs, Stat].
Altschuler, Bach, Rudi, et al. 2019. Massively Scalable Sinkhorn Distances via the Nyström Method.” In Advances in Neural Information Processing Systems 32.
Bach, Francis R. 2013. Sharp Analysis of Low-Rank Kernel Matrix Approximations. In COLT.
Bach, Francis. 2015. On the Equivalence Between Kernel Quadrature Rules and Random Feature Expansions.”
Bach, Francis R, and Jordan. 2002. “Kernel Independent Component Analysis.” Journal of Machine Learning Research.
Bakır, Weston, and Schölkopf. 2004. Learning to Find Pre-Images.” Advances in Neural Information Processing Systems.
Beylkin, and Monzón. 2010. Approximation by Exponential Sums Revisited.” Applied and Computational Harmonic Analysis.
Brault, d’Alché-Buc, and Heinonen. 2016. Random Fourier Features for Operator-Valued Kernels.” In Proceedings of The 8th Asian Conference on Machine Learning.
Brault, Lim, and d’Alché-Buc. n.d. Scaling up Vector Autoregressive Models With Operator-Valued Random Fourier Features.
Cheney, and Light. 2009. A Course in Approximation Theory.
Cheng, and Singer. 2013. The Spectrum of Random Inner-Product Kernel Matrices.” Random Matrices: Theory and Applications.
Choromanski, Rowland, and Weller. 2017. The Unreasonable Effectiveness of Random Orthogonal Embeddings.” arXiv:1703.00864 [Stat].
Choromanski, and Sindhwani. 2016. Recycling Randomness with Structure for Sublinear Time Kernel Expansions.” arXiv:1605.09049 [Cs, Stat].
Curto, Zarza, Yang, et al. 2017. F2F: A Library For Fast Kernel Expansions.” arXiv:1702.08159 [Cs].
Cutajar, Bonilla, Michiardi, et al. 2017. Random Feature Expansions for Deep Gaussian Processes.” In PMLR.
Drineas, and Mahoney. 2005. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research.
El Karoui. 2010. The Spectrum of Kernel Random Matrices.” The Annals of Statistics.
Globerson, and Livni. 2016. Learning Infinite-Layer Networks: Beyond the Kernel Trick.” arXiv:1606.05316 [Cs].
Guhaniyogi, and Dunson. 2013. Bayesian Compressed Regression.”
Kar, and Karnick. 2012. Random Feature Maps for Dot Product Kernels.” In Artificial Intelligence and Statistics.
Koltchinskii, and Giné. 2000. Random Matrix Approximation of Spectra of Integral Operators.” Bernoulli.
Kwok, and Tsang. 2004. The Pre-Image Problem in Kernel Methods.” IEEE Transactions on Neural Networks.
Le, Sarlós, and Smola. 2013. Fastfood-Approximating Kernel Expansions in Loglinear Time.” In Proceedings of the International Conference on Machine Learning.
Minh, Niyogi, and Yao. 2006. Mercer’s Theorem, Feature Maps, and Smoothing.” In International Conference on Computational Learning Theory. Lecture Notes in Computer Science.
Pourkamali-Anaraki, and Becker. 2016a. A Randomized Approach to Efficient Kernel Clustering.” arXiv:1608.07597 [Stat].
———. 2016b. Randomized Clustered Nystrom for Large-Scale Kernel Machines.” arXiv:1612.06470 [Cs, Stat].
Rahimi, and Recht. 2007. Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems.
———. 2008. Uniform Approximation of Functions with Random Bases.” In 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
———. 2009. Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems.
Scardapane, and Wang. 2017. Randomness in Neural Networks: An Overview.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Schölkopf, Knirsch, Smola, et al. 1998. Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces.” In Mustererkennung 1998. Informatik Aktuell.
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
Schölkopf, Smola, and Müller. 1997. Kernel Principal Component Analysis.” In Artificial Neural Networks — ICANN’97. Lecture Notes in Computer Science.
Solin, and Särkkä. 2020. Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing.
Sterge, and Sriperumbudur. 2021. Statistical Optimality and Computational Efficiency of Nyström Kernel PCA.” arXiv:2105.08875 [Cs, Math, Stat].
Strobl, Zhang, and Visweswaran. 2017. Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery.” arXiv:1702.03877 [Stat].
Sutherland, and Schneider. 2015. On the Error of Random Fourier Features.” In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.
Tompkins, and Ramos. 2018. Fourier Feature Approximations for Periodic Kernels in Time-Series Modelling.” Proceedings of the AAAI Conference on Artificial Intelligence.
Vedaldi, and Zisserman. 2012. Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vempati, Vedaldi, Zisserman, et al. 2010. Generalized RBF Feature Maps for Efficient Detection.” In Procedings of the British Machine Vision Conference 2010.
Wang, Li, Khabsa, et al. 2020. Linformer: Self-Attention with Linear Complexity.”
Williams, Christopher K. I. 2001. On a Connection Between Kernel PCA and Metric Multidimensional Scaling.” In Advances in Neural Information Processing Systems 13.
Williams, Christopher KI, and Seeger. 2001. Using the Nyström Method to Speed Up Kernel Machines.” In Advances in Neural Information Processing Systems.
Yang, Tianbao, Li, Mahdavi, et al. 2012. Nyström Method Vs Random Fourier Features: A Theoretical and Empirical Comparison.” In Advances in Neural Information Processing Systems.
Yang, Jiyan, Sindhwani, Avron, et al. 2014. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels.” arXiv:1412.8293 [Cs, Math, Stat].
Yu, Yaoliang, Cheng, Schuurmans, et al. 2013. Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Yu, Chenhan D., March, and Biros. 2017. An \(N \log N\) Parallel Fast Direct Solver for Kernel Matrices.” In arXiv:1701.02324 [Cs].
Zhang, Jaslyn. 2017. Improved Genomic Selection Using Vowpal Wabbit with Random Fourier Features.”
Zhang, Qinyi, Filippi, Gretton, et al. 2016. Large-Scale Kernel Methods for Independence Testing.” arXiv:1606.07892 [Stat].