# Overparameterization in large models

Improper learning, benign overfitting, double descent

April 4, 2018 — October 29, 2024

Notes on the general weird behaviour of increasing the number of slack parameters we use, especially in machine learning, especially in neural nets. Most of these have far more parameters than we “need,” which is a problem for classical models of learning. Herein we learn to fear having too many parameters.

## 1 For making optimisation nice

Certainly, looking at how some classic non-convex optimization problems can be lifted into convex problems by adding slack variables, we can imagine that something similar happens by analogy in neural nets. Is it enough to imagine that something similar happens in NN, perhaps not lifting them into convex problems *per se* but at least into better-behaved optimisations in some sense?

The combination of overparameterization and SGD is argued to be the secret to how deep learning works, by e.g. AllenZhuConvergence2018.

RJ Lipton discusses Arno van den Essen’s incidental work on stabilisation methods of polynomials, which relates, AFAICT, to transfer-function-type stability. Does this connect to the overparameterization of rational transfer function analysis of Hardt, Ma, and Recht (2018)? *🏗*.

## 2 Double descent

When adding data (or parameters?) can make the model worse. E.g. Deep Double Descent.

Possibly this phenomenon relates to the concept of data interpolation, although see Resolution of misconception of overfitting: Differentiating learning curves from Occam curves.

## 3 Data interpolation

a.k.a. benign overfitting. See interpolation/extrapolation in NNs.

## 4 Lottery ticket hypothesis

The Lottery Ticket hypothesis (Frankle and Carbin 2019; Hayou et al. 2020) asserts something like “there is a good compact network hidden inside the overparameterized one you have.” Intuitively it is computationally hard to find the hidden optimal network. I am interested in computational bounds for this; *How much* cheaper is it to calculate with a massive network than to find the tiny networks that do better?

## 5 In extremely large models

## 6 In the wide-network limit

See Wide NNs.

## 7 Convex relaxation

See convex relaxation.

## 8 In weight space versus in function space

## 9 References

*arXiv:1802.06509 [Cs]*.

*arXiv:1309.3117 [Cs, Math]*.

*arXiv:1501.00046 [Cs, Math, Stat]*.

*arXiv:1610.04210 [Cs, Math, Stat]*.

*Acta Numerica*.

*arXiv:1703.11008 [Cs]*.

*arXiv:1803.03635 [Cs]*.

*arXiv:2104.05508 [Cs, Stat]*.

*arXiv:1610.07531 [Cs, Math]*.

*The Journal of Machine Learning Research*.

*Neuron*.

*arXiv:2002.08797 [Cs, Stat]*.

*NIPS*.

*Proceedings of ICML*.

*arXiv:1912.02292 [Cs, Stat]*.

*Perspectives in Robust Control*. Lecture Notes in Control and Information Sciences.

*Neural Computation*.

*arXiv:1908.01755 [Cs, Stat]*.

*IEEE Transactions on Information Theory*.

*Proceedings of ICLR*.

*Communications of the ACM*.