Scaling laws for very large neural nets

Theory of trading-off budgets for compute size and data

2021-01-14 — 2025-02-24

AI safety

bounded compute

functional analysis

machine learning

Suspiciously similar content

Got good behaviour from a million parameter model? Want to see if stuff gets weirder as we hit a billion parameters? Turns out it does! It even seems to do so dependably! There’s something philosophically deep here. Why does looking at more stuff seem to bring more computationally complex problems within reach? I don’t know, but I’m keen to carve out some time to solve that.

Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the models’ behaviour. A new front in the complexity and/or statistical mechanics of statistics, and whether neural networks extrapolate.

For how to scale up these models in practice, see distributed gradient descent.

Content on this page hasn’t been updated as fast as the field has been moving; you should follow the references for the latest.

1 Bitter lessons in compute

See optimal cleverness.

2 Big transformers

One fun result comes from transformer language models. An interesting observation back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. I think this paper was ground zero for modern scaling studies, which try to identify and predict optimal trade-offs and ultimate performance under different scaling (of compute, data, parameters) regimes.

nostalgebraist summarises Henighan et al. (2020);Kaplan et al. (2020):

2.1 L(D): information

OpenAI derives a scaling law called L(D). This law is the best you could possibly do — even with arbitrarily large compute/models — if you are only allowed to train on D data points.

No matter how good your model is, there’s only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

2.2 L(C): budgeting

OpenAI also derives another scaling law called L(C). This is the best you can do with compute C if you spend it optimally.

What does optimal spending look like? Remember, you can spend a unit of compute on * a bigger model (N), or * training the same model for longer (S)

… In the compute regime we’re currently in, making the model bigger is way more effective than taking more steps.

The scaling laws continue to be revised.

2.3 Observational

Ruan, Maddison, and Hashimoto (2025):

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publicly available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behaviour and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.

Standard compute scaling In compute scaling laws, there is a hypothesized power-law relationship between models’ compute measures $C_{m}$ (e.g., training FLOPs) and their errors $E_{m}$ (e.g., perplexity). Specifically, for a model $m$ within a family $f$ (e.g., Llama-2 7B, 13B, and 70B) we hypothesize

$\log (E_{m}) \approx β_{f} \log (C_{m}) + α_{f}$

and if this linear fit is sufficiently accurate, we draw inferences about the performance of a model at future compute scales $C^{'} > C$ by extrapolating this relationship. However, fitting such a scaling law can be tricky, as each model family $f$ and downstream benchmark has its own scaling coefficients $β_{f}$ and $α_{f}$ . This means that scaling experiments, especially for post-training analysis, are often fitted on very few (3-5) models sharing the same model family, and any predictions are valid only for a specific scaling strategy used within a model family. Several studies […] have generalized the functional form to analyse the scaling of LMs’ downstream performance (where $E_{m}$ is normalised to $[0, 1]$ ) with a sigmoidal link function $σ$ :

$σ^{- 1} (E_{m}) \approx β_{f} \log (C_{m}) + α_{f}$

Observational scaling In our work, we hypothesize the existence of a low-dimensional capability measure for LMs that relate compute to more complex LM capabilities and can be extracted from observable standard LM benchmarks, as illustrated in Figure 2. Specifically, given $T$ simple benchmarks and $B_{i, m}$ the error of a model $m$ on benchmark $i \in [T]$ , we hypothesize that there exists some capability vector $S_{m} \in R^{K}$ such that,

$\begin{aligned} σ^{- 1} (E_{m}) & \approx β^{⊤} S_{m} + α \\ S_{m} & \approx θ_{f} \log (C_{m}) + ν_{f} \\ B_{i, m} & \approx γ_{i}^{⊤} S_{m} \end{aligned}$

for $θ_{f}, ν_{f}, β \in R^{K}, α \in R$ , and orthonormal vectors $γ_{i} \in R^{K}$ .

3 Incoming

The Scaling Paradox — Toby Ord
- AI progress as a function of time is impressive even if AI progress as a function of resources is not.
- The scaling laws are impressively smooth and long-lasting, but are a proof of poor but predictable scaling, rather than impressive scaling.
- While we know that AI quality metrics scale very poorly with respect to resources, the real-world impacts may scale much better.
Zhang et al. (2020) (how do NNs learn from language as n increases?)
DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization (targeting large language models)
Is Deep Learning Actually Hitting a Wall?
Scale, schlep, and systems

4 References

Biderman, Schoelkopf, Anthony, et al. 2023. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.” In Proceedings of the 40th International Conference on Machine Learning.

Brill. 2024. “Neural Scaling Laws Rooted in the Data Distribution.”

Douglas, and Verstyuk. 2025. “Progress in Artificial Intelligence and Its Determinants.”

Henighan, Kaplan, Katz, et al. 2020. “Scaling Laws for Autoregressive Generative Modeling.” arXiv:2010.14701 [Cs].

Hoffmann, Borgeaud, Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.”

Hu, Song, Weinstein, et al. 2022. “Training Overparametrized Neural Networks in Sublinear Time.”

Hutter. 2021. “Learning Curve Theory.”

Kaplan, McCandlish, Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” arXiv:2001.08361 [Cs, Stat].

Kirstain, Lewis, Riedel, et al. 2021. “A Few More Examples May Be Worth Billions of Parameters.” arXiv:2110.04374 [Cs].

Kumar, Bradbury, Young, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” arXiv:2011.03641 [Cs].

Mahowald, Ivanova, Blank, et al. 2024. “Dissociating language and thought in large language models.” Trends in Cognitive Sciences.

Muennighoff, Rush, Barak, et al. 2023. “Scaling Data-Constrained Language Models.” Advances in Neural Information Processing Systems.

Naveed, Khan, Qiu, et al. 2024. “A Comprehensive Overview of Large Language Models.”

Owen. 2024. “How Predictable Is Language Model Benchmark Performance?”

Ruan, Maddison, and Hashimoto. 2025. “Observational Scaling Laws and the Predictability of Langauge Model Performance.” In Advances in Neural Information Processing Systems.

Schaeffer, Miranda, and Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems.

Sharma, and Kaplan. 2020. “A Neural Scaling Law from the Dimension of the Data Manifold.” arXiv:2004.10802 [Cs, Stat].

Sorscher, Geirhos, Shekhar, et al. 2023. “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.”

Thirunavukarasu, Ting, Elangovan, et al. 2023. “Large Language Models in Medicine.” Nature Medicine.

Togelius, and Yannakakis. 2023. “Choose Your Weapon: Survival Strategies for Depressed AI Academics.”

Wei, Tay, Bommasani, et al. 2022. “Emergent Abilities of Large Language Models.”

Zhang, Warstadt, Li, et al. 2020. “When Do You Need Billions of Words of Pretraining Data?” arXiv:2011.04946 [Cs].

Zhao, Zhou, Li, et al. 2024. “A Survey of Large Language Models.”