ML scaling laws in the massive model limit


Brief links on the theme of scaling in the extremely large model/large data limit.

Big transformers

One fun result comes from Transformer language models. Possibly a fruitful new front in the complexity of statistics. An interesting observation way back in 2020 was that there seemed to be an unexpectedly trade-off where you can go faster by training a bigger network. Indeed, there is a whole family of observations in this vein.

nostalgebrait’s summarises Henighan et al. (2020); Kaplan et al. (2020):

L(D): information

OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

L(​C): budgeting

OpenAI also derives another a scaling law called L(​C). This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like? Remember, you can spend a unit of compute on

  • a bigger model (N), or
  • training the same model for longer (S)

…In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

Bitter lesson

The bitter lesson of history in AI is that “general methods that leverage computation are ultimately the most effective, and by a large margin.”

Misc

  • Exploring the limits of Concurrency in ML Training on Google TPUs Kumar et al. (2020) (BERT in 23s on a TPU-4096; “We view the current competition in language understanding as a modern-day Space Race, with competing organizations assembling both giant machines and giant models in the quest for an Artificial General Intelligence breakthrough.”)
  • Zhang et al. (2020) (how do NNs learn from language as n increases?

References

Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, et al. 2020. “Scaling Laws for Autoregressive Generative Modeling.” November 5, 2020. http://arxiv.org/abs/2010.14701.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” January 22, 2020. http://arxiv.org/abs/2001.08361.
Kumar, Sameer, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” November 6, 2020. http://arxiv.org/abs/2011.03641.
Sharma, Utkarsh, and Jared Kaplan. 2020. “A Neural Scaling Law from the Dimension of the Data Manifold.” April 22, 2020. http://arxiv.org/abs/2004.10802.
Zhang, Yian, Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman. 2020. “When Do You Need Billions of Words of Pretraining Data?” November 10, 2020. http://arxiv.org/abs/2011.04946.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.