Scaling laws for very large neural nets

Compute/size/data tradeoffs



Got good behaviour from a million parameter model? Want to see if stuff gets weirder as we hit a billion parameters? Turns out it does!

Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the behaviour of the models. A new front in the complexity, and/or statistical mechanics of statistics.

As to how to scale up these models in practice, see distributed gradient descent.

Side note: The better lesson

Sutton’s famous bitter lesson:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Lots of people declaim this one, e.g. On the futility of trying to be clever (the bitter lesson redux).

The better lesson is:

The biggest lesson that can be read from 70 years of AI research is that a lot of the good ideas that did not require massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage if we want to get cited.

Research, and indeed predictive analytics, is a competitive market, and advice about relative advantage needs strategic context. But it does not sound as profound if we phrase it that way, eh?

Ermin Orhan argues

I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

  1. Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by levering everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

  2. Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

  3. Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g. more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

That’s great if you have essentially unlimited data, but this is not a helpful framing for me working in hydrology where a single data point on my last project cost AUD700,000, because that is what it costs to drill a thousand meter well. Telling me that I should collect a billion more data points rather than being clever is not useful, because it would not be clever to collapse the entire global economy collecting my data points.

What we generally want is not a homily about being clever being a waste of time, so much as a trade-off curve quantifying how clever to bother being. That would probably be the best lesson. And investigating that trade-off will hopefully happen here.

Big transformers

One fun result comes from Transformer language models. An interesting observation way back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. Indeed, there is a whole family of observations in this vein trying to identify actual scaling behaviour.

nostalgebraist summarises Henighan et al. (2020);Kaplan et al. (2020):

L(D): information

OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

L(​C): budgeting

OpenAI also derives another a scaling law called L(​C). This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like? Remember, you can spend a unit of compute on

  • a bigger model (N), or
  • training the same model for longer (S)

…In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

Controversy! the scaling laaws have been revised.

Incoming

References

Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, et al. 2020. β€œScaling Laws for Autoregressive Generative Modeling.” arXiv:2010.14701 [Cs], November.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. β€œTraining Compute-Optimal Large Language Models.” arXiv.
Hu, Hang, Zhao Song, Omri Weinstein, and Danyang Zhuo. 2022. β€œTraining Overparametrized Neural Networks in Sublinear Time.” arXiv.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. β€œScaling Laws for Neural Language Models.” arXiv:2001.08361 [Cs, Stat], January.
Kirstain, Yuval, Patrick Lewis, Sebastian Riedel, and Omer Levy. 2021. β€œA Few More Examples May Be Worth Billions of Parameters.” arXiv:2110.04374 [Cs], October.
Kumar, Sameer, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, et al. 2020. β€œExploring the Limits of Concurrency in ML Training on Google TPUs.” arXiv:2011.03641 [Cs], November.
Sharma, Utkarsh, and Jared Kaplan. 2020. β€œA Neural Scaling Law from the Dimension of the Data Manifold.” arXiv:2004.10802 [Cs, Stat], April.
Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. 2023. β€œBeyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning.” arXiv.
Togelius, Julian, and Georgios N. Yannakakis. 2023. β€œChoose Your Weapon: Survival Strategies for Depressed AI Academics.” arXiv.
Zhang, Yian, Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman. 2020. β€œWhen Do You Need Billions of Words of Pretraining Data?” arXiv:2011.04946 [Cs], November.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.