Scaling laws for very large neural nets

Compute/size/data tradeoffs

January 14, 2021 — February 16, 2023

bounded compute

functional analysis

machine learning

model selection

optimization

statmech

Got good behaviour from a million parameter model? Want to see if stuff gets weirder as we hit a billion parameters? Turns out it does!

Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the behaviour of the models. A new front in the complexity, and/or statistical mechanics of statistics.

As to how to scale up these models in practice, see distributed gradient descent.

1 Side note: The bitter, better lesson

See optimal cleverness.

2 Big transformers

One fun result comes from Transformer language models. An interesting observation way back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. Indeed, there is a whole family of observations in this vein trying to identify actual scaling behaviour.

nostalgebraist summarises Henighan et al. (2020);Kaplan et al. (2020):

2.1 L(D): information

OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

2.2 L(C): budgeting

OpenAI also derives another a scaling law called L(C). This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like? Remember, you can spend a unit of compute on

a bigger model (N), or

training the same model for longer (S)

…In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

Controversy! the scaling laws have been revised.

3 Incoming

Zhang et al. (2020) (how do NNs learn from language as n increases?
DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization (targeting large language models)

4 References

Henighan, Kaplan, Katz, et al. 2020. “Scaling Laws for Autoregressive Generative Modeling.” arXiv:2010.14701 [Cs].

Hoffmann, Borgeaud, Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.”

Hu, Song, Weinstein, et al. 2022. “Training Overparametrized Neural Networks in Sublinear Time.”

Kaplan, McCandlish, Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” arXiv:2001.08361 [Cs, Stat].

Kirstain, Lewis, Riedel, et al. 2021. “A Few More Examples May Be Worth Billions of Parameters.” arXiv:2110.04374 [Cs].

Kumar, Bradbury, Young, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” arXiv:2011.03641 [Cs].

Sharma, and Kaplan. 2020. “A Neural Scaling Law from the Dimension of the Data Manifold.” arXiv:2004.10802 [Cs, Stat].

Sorscher, Geirhos, Shekhar, et al. 2023. “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.”

Togelius, and Yannakakis. 2023. “Choose Your Weapon: Survival Strategies for Depressed AI Academics.”

Zhang, Warstadt, Li, et al. 2020. “When Do You Need Billions of Words of Pretraining Data?” arXiv:2011.04946 [Cs].

1 Side note: The bitter, better lesson

2 Big transformers

2.1 L(D): information

2.2 L(​C): budgeting

3 Incoming

4 References

2.2 L(C): budgeting