Gradient descent at scale

Little to say here for now but I need to remember various terms for later use. ZeRO and other methods of parallel/sharded gradient descent to enable much larger models for a fixed GPU memory budget.


Kumar, Sameer, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” November 6, 2020.
Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” May 13, 2020.
Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.” April 15, 2021.
Rasley, Jeff, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3505–6. KDD ’20. New York, NY, USA: Association for Computing Machinery.
Ren, Jie, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.” January 17, 2021.
Tang, Hanlin, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021. “1-Bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed.” June 29, 2021.
Zhang, Minjia, and Yuxiong He. 2020. “Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping.” Advances in Neural Information Processing Systems 33: 14011–23.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.