# Statistical learning theory

Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation-and-stability framings etc

July 6, 2016 — August 16, 2016

Another placeholder far from my own background.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question *a priori*, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation. When I did statistics we talked about this in terms of regularisation and [model selection(./model_selection.html). However, the ML people started thinking about this form the other end.

Depending on your training you might think about this in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs… Modern results seem to appeal to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT, *nice* statistical learning results apply to particular classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise.

## 1 VC dimension

## 2 Rademacher complexity

## 3 Stability-based

A# Stability-based Also apparently stability of estimators gets you somewhere?

## 4 PAC-learning

Probably approximately correct, you mean?

## 5 Non-I.I.D data

## 6 Stein’s method

See Stein’s method

## 7 Incoming

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get, and different learning theories.

I would like results that applied to dependent data.

(Golubev and Nussbaum 1990; Hasminskii and Ibragimov 1990) give minimax convergence rates in Sobolev class regression.

This looks like fun (David, Moran, and Yehudayoff 2016):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.

## 8 References

*Journal of Machine Learning Research*.

*arXiv:1704.02958 [Cs, Stat]*.

*Advances in Neural Information Processing Systems 27*.

*arXiv:1708.03395 [Cond-Mat, Physics:math-Ph]*.

*arXiv:1705.02780 [Cond-Mat]*.

*An Introduction to Stein’s Method*. Lecture Notes Series / Institute for Mathematical Sciences, National University of Singapore, v. 4.

*Information Theory Workshop, 2008. ITW’08. IEEE*.

*Journal of the American Statistical Association*.

*Journal of Machine Learning Research*.

*arXiv:1609.06675 [Math, Stat]*.

*Annales de l’Institut Henri Poincaré, Probabilités Et Statistiques*.

*arXiv:0808.1416 [Math, Stat]*.

*J. ACM*.

*Advanced Lectures on Machine Learning*. Lecture Notes in Computer Science 3176.

*Electronic Journal of Statistics*.

*Learning Theory*. Lecture Notes in Computer Science.

*arXiv:1209.1077 [Cs, Stat]*.

*Digital Signal Processing*.

*arXiv:1410.0247 [Math, Stat]*.

*ICML*.

*arXiv:1610.03592 [Cs, Math]*.

*A Probabilistic Theory of Pattern Recognition*.

*NIPS*.

*New Journal of Physics*.

*arXiv:1108.0373 [Math, Stat]*.

*Applied Mathematical Sciences*.

*arXiv:1712.06541 [Cs, Stat]*.

*The Annals of Statistics*.

*arXiv:1610.00768 [Cs, Stat]*.

*arXiv:1706.07180 [Cs, Math, Stat]*.

*Bernoulli*.

*The Annals of Statistics*.

*Statistical Learning with Sparsity: The Lasso and Generalizations*.

*The Elements of Statistical Learning: Data Mining, Inference and Prediction*.

*Journal of Machine Learning Research*.

*Machine Learning and Knowledge Discovery in Databases*. Lecture Notes in Computer Science.

*Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems*. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*.

*Advances in Neural Information Processing Systems*.

*Bernoulli*.

*Proceedings of the 32nd International Conference on Machine Learning*.

*Eprint arXiv:1111.3404*.

*arXiv:1103.0942 [Cs, Stat]*.

*arXiv:1106.0730 [Cs, Stat]*.

*arXiv:1607.06534 [Stat]*.

*arXiv:1609.08397 [Stat]*.

*Machine Learning*.

*WIREs Data Mining and Knowledge Discovery*.

*arXiv:1211.5063 [Cs]*.

*Journal of Machine Learning Research*.

*Complex Stochastic Systems*.

*Probability Theory and Related Fields*.

*The Annals of Statistics*.

*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*.

*Advanced Lectures on Machine Learning*. Lecture Notes in Computer Science 2600.

*arXiv:1908.01755 [Cs, Stat]*.

*CoRR*.

*arXiv:1802.08855 [Cs, Math, Stat]*.

*An Introduction to Matrix Concentration Inequalities*.

*arXiv Preprint arXiv:1405.7764*.

*Journal of Machine Learning Research*.

*Empirical Process Techniques for Dependent Data*.

*The Annals of Statistics*.

*arXiv:1403.7023 [Math, Stat]*.

*The Nature of Statistical Learning Theory*.

*Theory of Probability & Its Applications*.

*Neural Computation*.

*arXiv:0810.4752 [Math, Stat]*.

*Computers & Mathematics with Applications*.

*Advances In Neural Information Processing Systems*.

*Proceedings of ICLR*.