Little to say here for now but I need to remember various terms for later use. ZeRO and other methods of parallel/sharded gradient descent to enable much larger models for a fixed GPU memory budget.
- Advanced GPU Optimized Training — PyTorch Lightning 1.4.0dev documentation
- ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters - Microsoft Research
- Fit More and Train Faster With ZeRO via DeepSpeed and FairScale
- Model Parallelism and Big Models · Issue #8771 · huggingface/transformers
- Training 10x Larger Models and Accelerating Training on a Single GPU with ZeRO-Offloading
- microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
- Exploring the limits of Concurrency in ML Training on Google TPUs Kumar et al. (2020) (BERT in 23s on a TPU-4096; “We view the current competition in language understanding as a modern-day Space Race, with competing organizations assembling both giant machines and giant models in the quest for an Artificial General Intelligence breakthrough.”)
Kumar, Sameer, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, et al. 2020. “Exploring the Limits of Concurrency in ML Training on Google TPUs.” November 6, 2020. http://arxiv.org/abs/2011.03641.
Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” May 13, 2020. http://arxiv.org/abs/1910.02054.
Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.” April 15, 2021. http://arxiv.org/abs/2104.07857.
Rasley, Jeff, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3505–6. KDD ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3394486.3406703.
Ren, Jie, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. “ZeRO-Offload: Democratizing Billion-Scale Model Training.” January 17, 2021. http://arxiv.org/abs/2101.06840.
Tang, Hanlin, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021. “1-Bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed.” June 29, 2021. http://arxiv.org/abs/2102.02888.
Zhang, Minjia, and Yuxiong He. 2020. “Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping.” Advances in Neural Information Processing Systems 33: 14011–23. http://arxiv.org/abs/2010.13369.