Overparameterization in large models

Improper learning, benign overfitting, double descent

2018-04-03 — 2026-01-06

Wherein the interplay of surplus parameters and stochastic gradient descent is examined, the computational cost of identifying hidden ’lottery ticket’ subnetworks is considered, and occurrences of double descent are noted.

decision theory
feature construction
machine learning
model selection
neural nets
optimization
probabilistic algorithms
SDEs
statmech
stochastic processes

Notes on the general weird behaviour when increasing the number of slack parameters we use, especially in machine learning and in neural nets. For small problems, models like these often have far more parameters than we “need,” which causes problems for classical models of learning.

I’m not sure I named this well. Modern models are massively parameterized, but not necessarily over-parameterized in the classical sense, since they are supposed to learn massive data sets. cf scaling laws.

Figure 1

1 For making optimization “nice”

Certainly, looking at how some classic non-convex optimization problems can be lifted into convex problems by adding slack variables, we can imagine that something similar happens by analogy in neural nets. Is it enough to imagine that something similar happens in neural nets, perhaps not lifting them into convex problems per se but at least making the optimization better behaved in some sense?

The combination of overparameterization and SGD is argued to be the secret to how deep learning works, e.g. AllenZhuConvergence2018.

RJ Lipton discusses Arno van den Essen’s incidental work on stabilization methods of polynomials, which, as far as I can tell, relates to transfer-function-type stability. Does this connect to the overparameterization in the rational transfer-function analysis of Hardt, Ma, and Recht (2018)? 🏗.

2 Double descent

Adding data (or parameters?) can sometimes make the model worse. E.g. Deep Double Descent.

Possibly this phenomenon relates to the concept of data interpolation, although see Resolution of misconception of overfitting: Differentiating learning curves from Occam curves.

3 Data interpolation

a.k.a. benign overfitting. See interpolation/extrapolation in NNs.

4 Lottery ticket hypothesis

Figure 2

The Lottery Ticket hypothesis (Frankle and Carbin 2019; Hayou et al. 2020) asserts something like “there is a good compact network hidden inside the overparameterized one we have.” Intuitively, it’s computationally hard to find the hidden optimal network. I’m interested in computational bounds for this: How much cheaper is it to compute with a massive network than to find the tiny networks that outperform the large network? I’m also curious whether this helps with NN interpretation.

5 Let’s model it using singular learning theory

See singular learning theory.

6 In extremely large models

See neural nets at scale

7 In the wide-network limit

See Wide NNs.

8 Convex relaxation

See convex relaxation.

9 Weight space versus function space

See NNs in function space and also singular learning theory.

10 References

Allen-Zhu, Li, and Song. 2018. A Convergence Theory for Deep Learning via Over-Parameterization.”
Arora, Cohen, and Hazan. 2018. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.” arXiv:1802.06509 [Cs].
Bach. 2013. Convex Relaxations of Structured Matrix Factorizations.” arXiv:1309.3117 [Cs, Math].
Bahmani, and Romberg. 2014. Lifting for Blind Deconvolution in Random Mask Imaging: Identifiability and Convex Relaxation.” arXiv:1501.00046 [Cs, Math, Stat].
———. 2016. Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation.” arXiv:1610.04210 [Cs, Math, Stat].
Bartlett, Montanari, and Rakhlin. 2021. Deep Learning: A Statistical Viewpoint.” Acta Numerica.
Bubeck, and Sellke. 2021. A Universal Law of Robustness via Isoperimetry.” In.
Dziugaite, and Roy. 2017. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data.” arXiv:1703.11008 [Cs].
Frankle, and Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv:1803.03635 [Cs].
Głuch, and Urbanke. 2021. Noether: The More Things Change, the More Stay the Same.” arXiv:2104.05508 [Cs, Stat].
Goldstein, and Studer. 2016. PhaseMax: Convex Phase Retrieval via Basis Pursuit.” arXiv:1610.07531 [Cs, Math].
Hardt, Ma, and Recht. 2018. Gradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research.
Hasson, Nastase, and Goldstein. 2020. Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks.” Neuron.
Hastie, Montanari, Rosset, et al. 2020. Surprises in High-Dimensional Ridgeless Least Squares Interpolation.”
Hayou, Ton, Doucet, et al. 2020. Pruning Untrained Neural Networks: Principles and Analysis.” arXiv:2002.08797 [Cs, Stat].
Hazan, Singh, and Zhang. 2017. Learning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
Hodgkinson, Heide, Salomone, et al. 2023. The Interpolating Information Criterion for Overparameterized Models.”
Le, and Clarke. 2017. A Bayes Interpretation of Stacking for M-Complete and M-Open Settings.” Bayesian Analysis.
Mingard, Valle-Pérez, Skalse, et al. 2021. Is SGD a Bayesian Sampler? Well, Almost.” Journal of Machine Learning Research.
Molchanov, Ashukha, and Vetrov. 2017. Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Nakkiran, Kaplun, Bansal, et al. 2019. Deep Double Descent: Where Bigger Models and More Data Hurt.” arXiv:1912.02292 [Cs, Stat].
Oliveira, and Skelton. 2001. Stability Tests for Constrained Linear Systems.” In Perspectives in Robust Control. Lecture Notes in Control and Information Sciences.
Ran, and Hu. 2017. Parameter Identifiability in Statistical Machine Learning: A Review.” Neural Computation.
Semenova, Rudin, and Parr. 2021. A Study in Rashomon Curves and Volumes: A New Perspective on Generalization and Model Simplicity in Machine Learning.” arXiv:1908.01755 [Cs, Stat].
Tropp. 2006. Just Relax: Convex Programming Methods for Identifying Sparse Signals in Noise.” IEEE Transactions on Information Theory.
You, Li, Xu, et al. 2019. Drawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks.” In.
Zhang, Bengio, Hardt, et al. 2017. Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.
———, et al. 2021. Understanding Deep Learning (Still) Requires Rethinking Generalization.” Communications of the ACM.