Statistical learning theory

Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation-and-stability framings etc


Image by Kareem Carr

Another placeholder far from my own background.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation. When I did statistics we talked about this in terms of regularisation and [model selection(./model_selection.html). However, the ML people started thinking about this form the other end.

Depending on your training you might think about this in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs… Modern results seem to appeal to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT, nice statistical learning results apply to very particular classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise.

Misc

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get here, and different learning theories.

I would like results that applied to dependent data.

VC dimension

Rademacher complexity## Stability-based

A# Stability-based Also apparently stability of estimators gets you somewhere?

PAC-learning

Probably approximately correct, you mean?

Non-I.I.D data

See learning for dependent data.

Stein’s method

I just ran into this via Reinert (Reinert 2000) and it looks handy. 🏗

Misc, to file

(Golubev and Nussbaum 1990; Hasminskii and Ibragimov 1990) give minimax convergence rates in Sobolev class regression.

This looks like fun (David, Moran, and Yehudayoff 2016):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.

Abbeel, Pieter, Daphne Koller, and Andrew Y. Ng. 2006. “Learning Factor Graphs in Polynomial Time and Sample Complexity.” Journal of Machine Learning Research 7 (December): 1743–88. http://machinelearning.wustl.edu/mlpapers/paper_files/AbbeelKN06.pdf.

Backurs, Arturs, Piotr Indyk, and Ludwig Schmidt. 2017. “On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks,” April. http://arxiv.org/abs/1704.02958.

Banerjee, Arindam, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. 2014. “Estimation with Norm Regularization.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 1556–64. Curran Associates, Inc. http://papers.nips.cc/paper/5465-estimation-with-norm-regularization.pdf.

Barbier, Jean, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. 2017. “Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models,” August. http://arxiv.org/abs/1708.03395.

Barbier, Jean, and Nicolas Macris. 2017. “The Stochastic Interpolation Method: A Simple Scheme to Prove Replica Formulas in Bayesian Inference,” May. http://arxiv.org/abs/1705.02780.

Barbour, A. D., and Louis Hsiao Yun Chen. 2005. An Introduction to Stein’s Method. World Scientific. http://books.google.com?id=iK9sPXG7A6EC.

Barron, Andrew R., Cong Huang, Jonathan Q. Li, and Xi Luo. 2008. “MDL, Penalized Likelihood, and Statistical Risk.” In Information Theory Workshop, 2008. ITW’08. IEEE, 247–57. IEEE. https://doi.org/10.1109/ITW.2008.4578660.

Bartlett, Peter L, Michael I Jordan, and Jon D McAuliffe. 2006. “Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association 101 (473): 138–56. https://doi.org/10.1198/016214505000000907.

Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” Journal of Machine Learning Research 3 (Nov): 463–82. http://www.jmlr.org/papers/v3/bartlett02a.html.

Bellec, Pierre C., and Alexandre B. Tsybakov. 2016. “Bounds on the Prediction Error of Penalized Least Squares Estimators with Convex Penalty,” September. http://arxiv.org/abs/1609.06675.

Bertin, K., E. Le Pennec, and V. Rivoirard. 2011. “Adaptive Dantzig Density Estimation.” Annales de L’Institut Henri Poincaré, Probabilités et Statistiques 47 (1): 43–74. https://doi.org/10.1214/09-AIHP351.

Birgé, Lucien. 2008. “Model Selection for Density Estimation with L2-Loss,” August. http://arxiv.org/abs/0808.1416.

Blumer, Anselm, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 1989. “Learnability and the Vapnik-Chervonenkis Dimension.” J. ACM 36 (4): 929–65. https://doi.org/10.1145/76359.76371.

Bousquet, Olivier, Stéphane Boucheron, and Gábor Lugosi. 2004. “Introduction to Statistical Learning Theory.” In Advanced Lectures on Machine Learning, edited by Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, 169–207. Lecture Notes in Computer Science 3176. Springer Berlin Heidelberg. http://www.econ.upf.edu/~lugosi/mlss_slt.pdf.

Bunea, Florentina, Alexandre B. Tsybakov, and Marten H. Wegkamp. 2007a. “Sparse Density Estimation with ℓ1 Penalties.” In Learning Theory, edited by Nader H. Bshouty and Claudio Gentile, 530–43. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-72927-3_38.

Bunea, Florentina, Alexandre Tsybakov, and Marten Wegkamp. 2007b. “Sparsity Oracle Inequalities for the Lasso.” Electronic Journal of Statistics 1: 169–94. https://doi.org/10.1214/07-EJS008.

Canas, Guillermo D., and Lorenzo Rosasco. 2012. “Learning Probability Measures with Respect to Optimal Transport Metrics,” September. http://arxiv.org/abs/1209.1077.

Carmi, Avishy Y. 2013. “Compressive System Identification: Sequential Methods and Entropy Bounds.” Digital Signal Processing 23 (3): 751–70. https://doi.org/10.1016/j.dsp.2012.12.006.

Chichignoud, Michaël, Johannes Lederer, and Martin Wainwright. 2014. “A Practical Scheme and Fast Algorithm to Tune the Lasso with Optimality Guarantees,” October. http://arxiv.org/abs/1410.0247.

Daneshmand, Hadi, Manuel Gomez-Rodriguez, Le Song, and Bernhard Schölkopf. 2014. “Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-Thresholding Algorithm.” In ICML. http://arxiv.org/abs/1405.2936.

David, Ofir, Shay Moran, and Amir Yehudayoff. 2016. “On Statistical Learning via the Lens of Compression,” October. http://arxiv.org/abs/1610.03592.

Devroye, Luc, László Györfi, and Gábor Lugosi. 1996. A Probabilistic Theory of Pattern Recognition. New York: Springer. http://www.szit.bme.hu/~gyorfi/pbook.pdf.

Dinh, Vu, Lam Si Tung Ho, Duy Nguyen, and Binh T. Nguyen. 2016. “Fast Learning Rates with Heavy-Tailed Losses.” In NIPS. http://arxiv.org/abs/1609.09481.

Flammia, Steven T., David Gross, Yi-Kai Liu, and Jens Eisert. 2012. “Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics 14 (9): 095022. https://doi.org/10.1088/1367-2630/14/9/095022.

Foygel, Rina, and Nathan Srebro. 2011. “Fast-Rate and Optimistic-Rate Error Bounds for L1-Regularized Regression,” August. http://arxiv.org/abs/1108.0373.

Geer, Sara van de. 2002. “On Hoeffdoing’s Inequality for Dependent Random Variables.” In Empirical Process Techniques for Dependent Data. Birkhhäuser.

———. 2014. “Worst Possible Sub-Directions in High-Dimensional Models.” In. Vol. 131. http://arxiv.org/abs/1403.7023.

Geer, Sara A. van de. 2008. “High-Dimensional Generalized Linear Models and the Lasso.” The Annals of Statistics 36 (2): 614–45. https://doi.org/10.1214/009053607000000929.

Gnecco, Giorgio, and Marcello Sanguineti. 2008. “Approximation Error Bounds via Rademacher’s Complexity.” Applied Mathematical Sciences 2 (4): 153–76. http://users.dibris.unige.it/gnecco/AMS1.pdf.

Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. “Size-Independent Sample Complexity of Neural Networks,” December. http://arxiv.org/abs/1712.06541.

Golubev, Grigori K., and Michael Nussbaum. 1990. “A Risk Bound in Sobolev Class Regression.” The Annals of Statistics 18 (2): 758–78. https://doi.org/10.1214/aos/1176347624.

Goodfellow, Ian, Nicolas Papernot, and Patrick McDaniel. 2016. “Cleverhans V0.1: An Adversarial Machine Learning Library,” October. http://arxiv.org/abs/1610.00768.

Gribonval, Rémi, Gilles Blanchard, Nicolas Keriven, and Yann Traonmilin. 2017. “Compressive Statistical Learning with Random Feature Moments,” June. http://arxiv.org/abs/1706.07180.

Hansen, Niels Richard, Patricia Reynaud-Bouret, and Vincent Rivoirard. 2015. “Lasso and Probabilistic Inequalities for Multivariate Point Processes.” Bernoulli 21 (1): 83–143. https://doi.org/10.3150/13-BEJ562.

Hasminskii, Rafael, and Ildar Ibragimov. 1990. “On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory.” The Annals of Statistics 18 (3): 999–1010. https://doi.org/10.1214/aos/1176347736.

Hastie, Trevor J., Rob Tibshirani, and Martin J. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: Chapman and Hall/CRC. https://web.stanford.edu/~hastie/StatLearnSparsity/index.html.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.

Huang, Cong, G. L. H. Cheang, and Andrew R. Barron. 2008. “Risk of Penalized Least Squares, Greedy Selection and L1 Penalization for Flexible Function Libraries.” http://www.stat.yale.edu/~arb4/publications_files/RiskGreedySelectionAndL1penalization.pdf.

Kabán, Ata. 2014. “New Bounds on Compressive Linear Least Squares Regression.” In Journal of Machine Learning Research, 448–56. http://jmlr.org/proceedings/papers/v33/kaban14.pdf.

Kloft, Marius, Ulrich Rückert, and Peter L. Bartlett. 2010. “A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15883-4_5.

Koltchinskii, Prof Vladimir. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033. Heidelberg: Springer. https://doi.org/10.1007/978-3-642-22147-7_1.

Krishnapuram, Balaji, Lawrence Carin, Mario A. T. Figueiredo, and Alexander J. Hartemink. 2005. “Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds.” IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (6): 957–68. https://doi.org/10.1109/TPAMI.2005.127.

Kuznetsov, Vitaly, and Mehryar Mohri. 2015. “Learning Theory and Algorithms for Forecasting Non-Stationary Time Series.” In Advances in Neural Information Processing Systems, 541–49. Curran Associates, Inc. http://papers.nips.cc/paper/5836-learning-theory-and-algorithms-for-forecasting-non-stationary-time-series.

Lederer, Johannes, and Sara van de Geer. 2014. “New Concentration Inequalities for Suprema of Empirical Processes.” Bernoulli 20 (4): 2020–38. https://doi.org/10.3150/13-BEJ549.

Luxburg, Ulrike von, and Bernhard Schölkopf. 2008. “Statistical Learning Theory: Models, Concepts, and Results,” October. http://arxiv.org/abs/0810.4752.

Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. “Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In ICML, 2113–22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

Maurya, Abhinav. 2018. “Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.

McDonald, Daniel J, Cosma Rohilla Shalizi, and Mark Schervish. 2011. “Estimated VC Dimension for Risk Bounds.” Eprint arXiv:1111.3404. http://arxiv.org/abs/1111.3404.

McDonald, Daniel J., Cosma Rohilla Shalizi, and Mark Schervish. 2011a. “Generalization Error Bounds for Stationary Autoregressive Models,” March. http://arxiv.org/abs/1103.0942.

———. 2011b. “Risk Bounds for Time Series Without Strong Mixing,” June. http://arxiv.org/abs/1106.0730.

Mei, Song, Yu Bai, and Andrea Montanari. 2016. “The Landscape of Empirical Risk for Non-Convex Losses,” July. http://arxiv.org/abs/1607.06534.

Meng, Qi, Yue Wang, Wei Chen, Taifeng Wang, Zhi-Ming Ma, and Tie-Yan Liu. 2016. “Generalization Error Bounds for Optimization Algorithms via Stability.” In, 10:441–74. http://arxiv.org/abs/1609.08397.

Natarajan, B. K. 1989. “On Learning Sets and Functions.” Machine Learning 4 (1): 67–97. https://doi.org/10.1007/BF00114804.

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In, 1310–8. http://arxiv.org/abs/1211.5063.

Reid, Mark D., and Robert C. Williamson. 2011. “Information, Divergence and Risk for Binary Experiments.” Journal of Machine Learning Research 12 (Mar): 731–817. http://www.jmlr.org/papers/v12/reid11a.html.

Reinert, Gesine. 2000. “Stein’s Method for Epidemic Processes.” In Complex Stochastic Systems, 235–75. Boca Raton: Chapman & Hall/CRC. http://www.stats.ox.ac.uk/~reinert/papers/episemrevnew.pdf.

Reynaud-Bouret, Patricia. 2003. “Adaptive Estimation of the Intensity of Inhomogeneous Poisson Processes via Concentration Inequalities.” Probability Theory and Related Fields 126 (1). https://doi.org/10.1007/s00440-003-0259-1.

Reynaud-Bouret, Patricia, and Sophie Schbath. 2010. “Adaptive Estimation for Hawkes Processes; Application to Genome Analysis.” The Annals of Statistics 38 (5): 2781–2822. https://doi.org/10.1214/10-AOS806.

Schölkopf, Bernhard, and Alexander J. Smola. 2003. “A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.

———. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Singh, Shashank, and Barnabás Póczos. 2018. “Minimax Distribution Estimation in Wasserstein Distance,” February. http://arxiv.org/abs/1802.08855.

Şimşekli, Umut, Ozan Sener, George Deligiannidis, and Murat A. Erdogdu. 2020. “Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks,” June. http://arxiv.org/abs/2006.09313.

Tropp, Joel A. 2015. “An Introduction to Matrix Concentration Inequalities,” January. http://arxiv.org/abs/1501.01571.

Tulabandhula, Theja, and Cynthia Rudin. 2014. “Generalization Bounds for Learning with Linear, Polygonal, Quadratic and Conic Side Knowledge.” arXiv Preprint arXiv:1405.7764. http://arxiv.org/abs/1405.7764.

Vainsencher, Daniel, Shie Mannor, and Alfred M. Bruckstein. 2011. “The Sample Complexity of Dictionary Learning.” Journal of Machine Learning Research 12 (Nov): 3259–81. http://www.jmlr.org/papers/v12/vainsencher11a.html.

Vapnik, V., and A. Chervonenkis. 1971. “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability & Its Applications 16 (2): 264–80. https://doi.org/10.1137/1116025.

Vapnik, Vladimir. 2010. The Nature of Statistical Learning Theory. Softcover reprint of hardcover 2nd ed. 2000. Springer.

Vapnik, Vladimir, Esther Levin, and Yann Le Cun. 1994. “Measuring the VC-Dimension of a Learning Machine.” Neural Computation 6 (5): 851–76. https://doi.org/10.1162/neco.1994.6.5.851.

Wu, Qiang, and Ding-Xuan Zhou. 2008. “Learning with Sample Dependent Hypothesis Spaces.” Computers & Mathematics with Applications 56 (11): 2896–2907. https://doi.org/10.1016/j.camwa.2008.09.014.

Xu, Aolin, and Maxim Raginsky. 2017. “Information-Theoretic Analysis of Generalization Capability of Learning Algorithms.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1705.07809.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR. http://arxiv.org/abs/1611.03530.