# Overparameterization

## a.k.a. improper learning

Notes on the general technique of increasing the number of slack parameters you have, especially in machine learning, especially especially in neural nets.

## For smoothness

This insight is fresh. Bubeck and Sellke (2021) argue

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in the current practice of deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $$d$$ times more parameters than mere interpolation, where $$d$$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

See Wide NNs.

## For making optimisation nice

The combination of overparameterization and SGD is argued to be the secret to how deep learning works, by Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song. Certainly, looking at how some classic optimizations can be lifted in to convex problems, we can imagine that something similar happens by analaogy here.

RJ Lipton discusses Arno van den Essen’s incidental work on stabilisation methods of polynomials, which relates. AFAICT, to transfer-function-type stability. Does this connect to the overparameterization of rational transfer function analysis of Hardt, Ma, and Recht (2018)? 🏗.

## Double descent

When adding data (or parameters?) can make the model worse. E.g. Deep Double Descent

## Lottery ticket hypothesis

The Lottery Ticket hypothesis asserts something like “there is a good compact network hidden inside the overparameterized one you have”. Intuitively it is computationally hard to find the hidden optimal network. I am interested in computational bounds for this; How much cheaper is it to calculate with a massive network than to find the tiny networks that does better? The calculus here is altered by SIMD architectures such as GPUs, which change the relative cost (although not the scaling) of certain types of calculations, which is arguably how we got to the modern form of neural net obsession.

## References

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018. November.
Bach, Francis. 2013. arXiv:1309.3117 [Cs, Math], September.
Bahmani, Sohail, and Justin Romberg. 2014. arXiv:1501.00046 [Cs, Math, Stat], December.
———. 2016. arXiv:1610.04210 [Cs, Math, Stat], October.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. March.
Bubeck, Sebastien, and Mark Sellke. 2021. In.
Dziugaite, Gintare Karolina, and Daniel M. Roy. 2017. arXiv:1703.11008 [Cs], October.
Frankle, Jonathan, and Michael Carbin. 2019. arXiv:1803.03635 [Cs], March.
Głuch, Grzegorz, and Rüdiger Urbanke. 2021. arXiv:2104.05508 [Cs, Stat], April.
Goldstein, Tom, and Christoph Studer. 2016. arXiv:1610.07531 [Cs, Math], October.
Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2018. The Journal of Machine Learning Research 19 (1): 1025–68.
Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. 2020. Neuron 105 (3): 416–34.
Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. arXiv:2002.08797 [Cs, Stat], June.
Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. In NIPS.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. In Proceedings of ICML.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2019. arXiv:1912.02292 [Cs, Stat], December.
Oliveira, Maurício C. de, and Robert E. Skelton. 2001. In Perspectives in Robust Control, 241–57. Lecture Notes in Control and Information Sciences. Springer, London.
Ran, Zhi-Yong, and Bao-Gang Hu. 2017. Neural Computation 29 (5): 1151–1203.
Semenova, Lesia, Cynthia Rudin, and Ronald Parr. 2021. arXiv:1908.01755 [Cs, Stat], April.
Tropp, J.A. 2006. IEEE Transactions on Information Theory 52 (3): 1030–51.
You, Haoran, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2019. In.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. In Proceedings of ICLR.
———. 2021. Communications of the ACM 64 (3): 107–15.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.