ML scaling in the massive parameter limit


Brief links on the theme of scaling in the extremely large model/large data limit. Especially relevant in Transformer language models for now.

  • Exploring the limits of Concurrency in ML Training on Google TPUs Kumar et al. (2020) (BERT in 23s on a TPU-4096; “We view the current competition in language understanding as a modern-day Space Race, with competing organizations assembling both giant machines and giant models in the quest for an Artificial General Intelligence breakthrough.”)
  • Zhang et al. (2020) (how do NNs learn from language as n increases?
  • tnostalgebrait’s summarys

References

Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, et al. 2020. “Scaling Laws for Autoregressive Generative Modeling.” November 5, 2020. http://arxiv.org/abs/2010.14701.
Kumar, Sameer, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” November 6, 2020. http://arxiv.org/abs/2011.03641.
Sharma, Utkarsh, and Jared Kaplan. 2020. “A Neural Scaling Law from the Dimension of the Data Manifold.” April 22, 2020. http://arxiv.org/abs/2004.10802.
Zhang, Yian, Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman. 2020. “When Do You Need Billions of Words of Pretraining Data?” November 10, 2020. http://arxiv.org/abs/2011.04946.