Overparameterization in large models

Improper learning, benign overfitting, double descent



Notes on the general weird behaviour of increasing the number of slack parameters we use, especially in machine learning, especially especially in neural nets. Most of these have far more parameters than we β€œneed” which is a problem for classical models of learning, herein we learn to fear having to many parameters.

For making optimisation nice

Certainly, looking at how some classic non-convex optimization problems can be lifted in to convex problem by adding slack variables, we can imagine that something similar happens by analogy in neural nets. Is it enough to imagine that something similar happens in NN, perhaps not lifting them into convex problems _per_se but at least into better-behaved optimisations in some sense?

The combination of overparameterization and SGD is argued to be the secret to how deep learning works, by e.g. AllenZhuConvergence2018.

RJ Lipton discusses Arno van den Essen’s incidental work on stabilisation methods of polynomials, which relates, AFAICT, to transfer-function-type stability. Does this connect to the overparameterization of rational transfer function analysis of Hardt, Ma, and Recht (2018)? πŸ—.

Double descent

When adding data (or parameters?) can make the model worse. E.g. Deep Double Descent.

Possibly this phenomenon relates to the concept of …

Data interpolation

a.k.a. benign overfitting. Bubeck and Sellke (2021) argue

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in the current practice of deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires \(d\) times more parameters than mere interpolation, where \(d\) is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

I am not sure if this is a distinct thing from other double descent phenomena. Hastie et al. (2020) suggests perhaps not?

Interpolators β€” estimators that achieve zero training error β€” have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum \(\ell_2\) norm (β€œridgeless”) interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors \(x_i \in {\mathbb R}^p\) are obtained by applying a linear transform to a vector of i.i.d.Β entries, \(x_i = \Sigma^{1/2} z_i\) (with \(z_i \in {\mathbb R}^p\)); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, \(x_i = \varphi(W z_i)\) (with \(z_i \in {\mathbb R}^d\), \(W \in {\mathbb R}^{p \times d}\) a matrix of i.i.d.Β entries, and \(\varphi\) an activation function acting componentwise on \(W z_i\)). We recover β€” in a precise quantitative way β€” several phenomena that have been observed in large-scale neural networks and kernel machines, including the β€œdouble descent” behavior of the prediction risk, and the potential benefits of overparametrization.

Lottery ticket hypothesis

The Lottery Ticket hypothesis (Frankle and Carbin 2019; Hayou et al. 2020) asserts something like β€œthere is a good compact network hidden inside the overparameterized one you have”. Intuitively it is computationally hard to find the hidden optimal network. I am interested in computational bounds for this; How much cheaper is it to calculate with a massive network than to find the tiny networks that does better? The calculus here is altered by SIMD architectures such as GPUs, which change the relative cost (although not the scaling) of certain types of calculations, which is arguably how we got to the modern form of neural net obsession.

In extremely large models

See neural nets at scale

In the wide-network limit

See Wide NNs.

References

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018. β€œA Convergence Theory for Deep Learning via Over-Parameterization,” November.
Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. β€œOn the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.” arXiv:1802.06509 [Cs], February.
Bach, Francis. 2013. β€œConvex Relaxations of Structured Matrix Factorizations.” arXiv:1309.3117 [Cs, Math], September.
Bahmani, Sohail, and Justin Romberg. 2014. β€œLifting for Blind Deconvolution in Random Mask Imaging: Identifiability and Convex Relaxation.” arXiv:1501.00046 [Cs, Math, Stat], December.
β€”β€”β€”. 2016. β€œPhase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation.” arXiv:1610.04210 [Cs, Math, Stat], October.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. β€œDeep Learning: A Statistical Viewpoint,” March.
Bubeck, Sebastien, and Mark Sellke. 2021. β€œA Universal Law of Robustness via Isoperimetry.” In.
Dziugaite, Gintare Karolina, and Daniel M. Roy. 2017. β€œComputing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data.” arXiv:1703.11008 [Cs], October.
Frankle, Jonathan, and Michael Carbin. 2019. β€œThe Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv:1803.03635 [Cs], March.
GΕ‚uch, Grzegorz, and RΓΌdiger Urbanke. 2021. β€œNoether: The More Things Change, the More Stay the Same.” arXiv:2104.05508 [Cs, Stat], April.
Goldstein, Tom, and Christoph Studer. 2016. β€œPhaseMax: Convex Phase Retrieval via Basis Pursuit.” arXiv:1610.07531 [Cs, Math], October.
Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2018. β€œGradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research 19 (1): 1025–68.
Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. 2020. β€œDirect Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks.” Neuron 105 (3): 416–34.
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. 2020. β€œSurprises in High-Dimensional Ridgeless Least Squares Interpolation.” arXiv.
Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. β€œPruning Untrained Neural Networks: Principles and Analysis.” arXiv:2002.08797 [Cs, Stat], June.
Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. β€œLearning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. β€œVariational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2019. β€œDeep Double Descent: Where Bigger Models and More Data Hurt.” arXiv:1912.02292 [Cs, Stat], December.
Oliveira, MaurΓ­cio C. de, and Robert E. Skelton. 2001. β€œStability Tests for Constrained Linear Systems.” In Perspectives in Robust Control, 241–57. Lecture Notes in Control and Information Sciences. Springer, London.
Ran, Zhi-Yong, and Bao-Gang Hu. 2017. β€œParameter Identifiability in Statistical Machine Learning: A Review.” Neural Computation 29 (5): 1151–1203.
Semenova, Lesia, Cynthia Rudin, and Ronald Parr. 2021. β€œA Study in Rashomon Curves and Volumes: A New Perspective on Generalization and Model Simplicity in Machine Learning.” arXiv:1908.01755 [Cs, Stat], April.
Tropp, J.A. 2006. β€œJust Relax: Convex Programming Methods for Identifying Sparse Signals in Noise.” IEEE Transactions on Information Theory 52 (3): 1030–51.
You, Haoran, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2019. β€œDrawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks.” In.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. β€œUnderstanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.
β€”β€”β€”. 2021. β€œUnderstanding Deep Learning (Still) Requires Rethinking Generalization.” Communications of the ACM 64 (3): 107–15.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.