Overparameterization in large models
Improper learning, benign overfitting, double descent
2018-04-03 — 2026-01-06
Wherein the interplay of surplus parameters and stochastic gradient descent is examined, the computational cost of identifying hidden ’lottery ticket’ subnetworks is considered, and occurrences of double descent are noted.
Notes on the general weird behaviour when increasing the number of slack parameters we use, especially in machine learning and in neural nets. For small problems, models like these often have far more parameters than we “need,” which causes problems for classical models of learning.
1 For making optimization “nice”
Certainly, looking at how some classic non-convex optimization problems can be lifted into convex problems by adding slack variables, we can imagine that something similar happens by analogy in neural nets. Is it enough to imagine that something similar happens in neural nets, perhaps not lifting them into convex problems per se but at least making the optimization better behaved in some sense?
The combination of overparameterization and SGD is argued to be the secret to how deep learning works, e.g. AllenZhuConvergence2018.
RJ Lipton discusses Arno van den Essen’s incidental work on stabilization methods of polynomials, which, as far as I can tell, relates to transfer-function-type stability. Does this connect to the overparameterization in the rational transfer-function analysis of Hardt, Ma, and Recht (2018)? 🏗.
2 Double descent
Adding data (or parameters?) can sometimes make the model worse. E.g. Deep Double Descent.
Possibly this phenomenon relates to the concept of data interpolation, although see Resolution of misconception of overfitting: Differentiating learning curves from Occam curves.
3 Data interpolation
a.k.a. benign overfitting. See interpolation/extrapolation in NNs.
4 Lottery ticket hypothesis
The Lottery Ticket hypothesis (Frankle and Carbin 2019; Hayou et al. 2020) asserts something like “there is a good compact network hidden inside the overparameterized one we have.” Intuitively, it’s computationally hard to find the hidden optimal network. I’m interested in computational bounds for this: How much cheaper is it to compute with a massive network than to find the tiny networks that outperform the large network? I’m also curious whether this helps with NN interpretation.
5 Let’s model it using singular learning theory
6 In extremely large models
7 In the wide-network limit
See Wide NNs.
8 Convex relaxation
See convex relaxation.
9 Weight space versus function space
See NNs in function space and also singular learning theory.

