Another placeholder far from my own background.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data?
If I can answer this question *a priori*, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.
When I did statistics we talked about this in terms of regularisation and [model selection(./model_selection.html).
However, the ML people started thinking about this form the other end.

Depending on your training you might think about this in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs… Modern results seem to appeal to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT,
*nice* statistical learning results apply to particular
classes of models;
e.g. SVMs and kernel regressions have built-in sample complexity results
under the right loss function,
but the ground gets rapidly shakier as you generalise.

## VC dimension

## Rademacher complexity

## Stability-based

A# Stability-based Also apparently stability of estimators gets you somewhere?

## PAC-learning

Probably approximately correct, you mean?

## Non-I.I.D data

## Stein’s method

See Stein’s method

## Incoming

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get, and different learning theories.

I would like results that applied to dependent data.

(Golubev and Nussbaum 1990; Hasminskii and Ibragimov 1990) give minimax convergence rates in Sobolev class regression.

This looks like fun (David, Moran, and Yehudayoff 2016):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.

## References

*Journal of Machine Learning Research*7 (December): 1743–88.

*arXiv:1704.02958 [Cs, Stat]*, April.

*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 1556–64. Curran Associates, Inc.

*arXiv:1708.03395 [Cond-Mat, Physics:math-Ph]*, August.

*arXiv:1705.02780 [Cond-Mat]*, May.

*An Introduction to Stein’s Method*. Vol. 4. Lecture Notes Series / Institute for Mathematical Sciences, National University of Singapore, v. 4. Singapore : Hackensack, N.J: Singapore University Press ; World Scientific.

*Information Theory Workshop, 2008. ITW’08. IEEE*, 247–57. IEEE.

*Journal of the American Statistical Association*101 (473): 138–56.

*Journal of Machine Learning Research*3 (Nov): 463–82.

*arXiv:1609.06675 [Math, Stat]*, September.

*Annales de l’Institut Henri Poincaré, Probabilités Et Statistiques*47 (1): 43–74.

*arXiv:0808.1416 [Math, Stat]*, August.

*J. ACM*36 (4): 929–65.

*Advanced Lectures on Machine Learning*, edited by Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, 169–207. Lecture Notes in Computer Science 3176. Springer Berlin Heidelberg.

*Learning Theory*, edited by Nader H. Bshouty and Claudio Gentile, 530–43. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*Electronic Journal of Statistics*1: 169–94.

*arXiv:1209.1077 [Cs, Stat]*, September.

*Digital Signal Processing*23 (3): 751–70.

*arXiv:1410.0247 [Math, Stat]*, October.

*ICML*.

*arXiv:1610.03592 [Cs, Math]*, October.

*A Probabilistic Theory of Pattern Recognition*. New York: Springer.

*NIPS*.

*New Journal of Physics*14 (9): 095022.

*arXiv:1108.0373 [Math, Stat]*, August.

*The Annals of Statistics*36 (2): 614–45.

*Empirical Process Techniques for Dependent Data*. Birkhhäuser.

*arXiv:1403.7023 [Math, Stat]*. Vol. 131.

*Applied Mathematical Sciences*2 (4): 153–76.

*arXiv:1712.06541 [Cs, Stat]*, December.

*The Annals of Statistics*18 (2): 758–78.

*arXiv:1610.00768 [Cs, Stat]*, October.

*arXiv:1706.07180 [Cs, Math, Stat]*, June.

*Bernoulli*21 (1): 83–143.

*The Annals of Statistics*18 (3): 999–1010.

*Statistical Learning with Sparsity: The Lasso and Generalizations*. Boca Raton: Chapman and Hall/CRC.

*The Elements of Statistical Learning: Data Mining, Inference and Prediction*. Springer.

*Journal of Machine Learning Research*, 448–56.

*Machine Learning and Knowledge Discovery in Databases*, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

*Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems*. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033. Heidelberg: Springer.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*27 (6): 957–68.

*Advances in Neural Information Processing Systems*, 541–49. Curran Associates, Inc.

*Bernoulli*20 (4): 2020–38.

*arXiv:0810.4752 [Math, Stat]*, October.

*Proceedings of the 32nd International Conference on Machine Learning*, 2113–22. PMLR.

*arXiv:1103.0942 [Cs, Stat]*, March.

*arXiv:1106.0730 [Cs, Stat]*, June.

*Eprint arXiv:1111.3404*.

*arXiv:1607.06534 [Stat]*, July.

*arXiv:1609.08397 [Stat]*, 10:441–74.

*Machine Learning*4 (1): 67–97.

*WIREs Data Mining and Knowledge Discovery*8 (4): e1252.

*arXiv:1211.5063 [Cs]*, 1310–18.

*Journal of Machine Learning Research*12 (Mar): 731–817.

*Complex Stochastic Systems*, 235–75. Boca Raton: Chapman & Hall/CRC.

*Probability Theory and Related Fields*126 (1).

*The Annals of Statistics*38 (5): 2781–2822.

*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.

*Advanced Lectures on Machine Learning*, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg.

*arXiv:1908.01755 [Cs, Stat]*, April.

*CoRR*abs/2006.09313.

*arXiv:1802.08855 [Cs, Math, Stat]*, February.

*An Introduction to Matrix Concentration Inequalities*.

*arXiv Preprint arXiv:1405.7764*.

*Journal of Machine Learning Research*12 (Nov): 3259–81.

*Theory of Probability & Its Applications*16 (2): 264–80.

*The Nature of Statistical Learning Theory*. Softcover reprint of hardcover 2nd ed. 2000. Springer.

*Neural Computation*6 (5): 851–76.

*Computers & Mathematics with Applications*56 (11): 2896–2907.

*Advances In Neural Information Processing Systems*.

*Proceedings of ICLR*.

## No comments yet. Why not leave one?