Gradient descent, first-order, stochastic

a.k.a. SGD, as seen in deep learning



Stochastic optimization, uses noisy (possibly approximate) 1st-order gradient information to find the argument which minimises

\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]

for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).

That this works with little fuss in very high dimensions is a major pillar of deep learning.

The original version, given in terms of root finding, is (Herbert Robbins and Monro 1951) who later generalised analysis in (H. Robbins and Siegmund 1971), using martingale arguments to analyze convergence. There is some historical context in (Lai 2003) which puts it all in context. That article was written before the current craze for SGD in deep learning; after 2013 or so the problem is rather that there is so much information on the method that the challenge becomes sifting out the AI hype from the useful.

I recommend Francis Bach’s Sum of geometric series trick as an introduction to showing things advanced things about SGD using elementary tools.

🏗

Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

Francesco Orabana on how to prove SGD converges

to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Variance-reduced

TBD

Normalized

Zhiyuan Li and Sanjeev Arora argue:

You may remember our previous blog post showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically, normalized nets, which is our term for nets that use any one of popular normalization schemes,e.g. BatchNorm (BN), GroupNorm (GN), WeightNorm (WN). Today’s post (based upon our paper with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses.

Sundry Hacks

Yellowfin an automatic SGD momentum tuner

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

References

Bach, Francis R., and Eric Moulines. 2013. “Non-Strongly-Convex Smooth Stochastic Approximation with Convergence Rate O(1/n).” In, 773–81. https://arxiv.org/abs/1306.2119v1.
Bach, Francis, and Eric Moulines. 2011. “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.” In Advances in Neural Information Processing Systems (NIPS), –. Spain. http://hal.archives-ouvertes.fr/hal-00608041.
Benaïm, Michel. 1999. “Dynamics of Stochastic Approximation Algorithms.” In Séminaire de Probabilités de Strasbourg, 33:1–68. Lecture Notes in Math. Berlin: Springer - Lecture Notes in Mathematics. https://doi.org/10.1007/BFb0096509.
Bensoussan, Alain, Yiqun Li, Dinh Phan Cao Nguyen, Minh-Binh Tran, Sheung Chi Phillip Yam, and Xiang Zhou. 2020. “Machine Learning and Control Theory.” June 9, 2020. http://arxiv.org/abs/2006.05604.
Botev, Zdravko I., and Chris J. Lloyd. 2015. “Importance Accelerated Robbins-Monro Recursion with Applications to Parametric Confidence Limits.” Electronic Journal of Statistics 9 (2): 2058–75. https://doi.org/10.1214/15-EJS1071.
Bottou, Léon. 1991. “Stochastic Gradient Learning in Neural Networks.” In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2. http://leon.bottou.org/papers/bottou-91c.
———. 1998. “Online Algorithms and Stochastic Approximations.” In Online Learning and Neural Networks, edited by David Saad, 17:142. Cambridge, UK: Cambridge University Press. http://leon.bottou.org/publications/pdf/online-1998.pdf.
———. 2010. “Large-Scale Machine Learning with Stochastic Gradient Descent.” In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), 177–86. Paris, France: Springer. http://leon.bottou.org/papers/bottou-2010.
Bottou, Léon, and Olivier Bousquet. 2008. “The Tradeoffs of Large Scale Learning.” In Advances in Neural Information Processing Systems, edited by J. C. Platt, D. Koller, Y. Singer, and S. Roweis, 20:161–68. NIPS Foundation (http://books.nips.cc). http://leon.bottou.org/papers/bottou-bousquet-2008.
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016. “Optimization Methods for Large-Scale Machine Learning.” June 15, 2016. http://arxiv.org/abs/1606.04838.
Bottou, Léon, and Yann LeCun. 2004. “Large Scale Online Learning.” In Advances in Neural Information Processing Systems 16, edited by Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf. Cambridge, MA: MIT Press. http://leon.bottou.org/papers/bottou-lecun-2004.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Vol. 8. Foundations and Trends in Machine Learning. Now Publishers. https://doi.org/10.1561/2200000050.
Cevher, Volkan, Stephen Becker, and Mark Schmidt. 2014. “Convex Optimization for Big Data.” IEEE Signal Processing Magazine 31 (5): 32–43. https://doi.org/10.1109/MSP.2014.2329397.
Chen, Xiaojun. 2012. “Smoothing Methods for Nonsmooth, Nonconvex Minimization.” Mathematical Programming 134 (1): 71–99. https://doi.org/10.1007/s10107-012-0569-0.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” November 30, 2020. http://arxiv.org/abs/2012.00152.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12 (July): 2121–59. http://www.jmlr.org/papers/v12/duchi11a.html.
Friedlander, Michael P., and Mark Schmidt. 2012. “Hybrid Deterministic-Stochastic Methods for Data Fitting.” SIAM Journal on Scientific Computing 34 (3): A1380–1405. https://doi.org/10.1137/110830629.
Ghadimi, Saeed, and Guanghui Lan. 2013a. “Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.” SIAM Journal on Optimization 23 (4): 2341–68. https://doi.org/10.1137/120880811.
———. 2013b. “Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming.” October 14, 2013. http://arxiv.org/abs/1310.3787.
Hazan, Elad, Kfir Levy, and Shai Shalev-Shwartz. 2015. “Beyond Convexity: Stochastic Quasi-Convex Optimization.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1594–1602. Curran Associates, Inc. http://papers.nips.cc/paper/5718-beyond-convexity-stochastic-quasi-convex-optimization.pdf.
Heyde, C. C. 1974. “On Martingale Limit Theory and Strong Convergence Results for Stochastic Approximation Procedures.” Stochastic Processes and Their Applications 2 (4): 359–70. https://doi.org/10.1016/0304-4149(74)90004-0.
Hu, Chonghai, Weike Pan, and James T. Kwok. 2009. “Accelerated Gradient Methods for Stochastic Optimization and Online Learning.” In Advances in Neural Information Processing Systems, 781–89. Curran Associates, Inc. http://papers.nips.cc/paper/3817-accelerated-gradient-methods-for-stochastic-optimization-and-online-learning.
Jakovetic, D., J. M. Freitas Xavier, and J. M. F. Moura. 2014. “Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks.” IEEE Transactions on Signal Processing 62 (4): 868–82. https://doi.org/10.1109/TSP.2013.2291221.
Kingma, Diederik, and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” Proceeding of ICLR. http://arxiv.org/abs/1412.6980.
Lai, Tze Leung. 2003. “Stochastic Approximation.” The Annals of Statistics 31 (2): 391–406. https://doi.org/10.1214/aos/1051027873.
Lee, Jason D., Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2017. “First-Order Methods Almost Always Avoid Saddle Points.” October 19, 2017. http://arxiv.org/abs/1710.07406.
Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. “Gradient Descent Converges to Minimizers.” March 4, 2016. http://arxiv.org/abs/1602.04915.
Ma, Siyuan, and Mikhail Belkin. 2017. “Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning.” March 30, 2017. http://arxiv.org/abs/1703.10622.
Mairal, Julien. 2013. “Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization.” In Advances in Neural Information Processing Systems, 2283–91. http://papers.nips.cc/paper/5129-stochastic-majorization-minimization-algorithms-for-large-scale-optimization.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, et al. 2013. “Ad Click Prediction: A View from the Trenches.” In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222–30. KDD ’13. New York, NY, USA: ACM. https://doi.org/10.1145/2487575.2488200.
Mitliagkas, Ioannis, Ce Zhang, Stefan Hadjis, and Christopher Ré. 2016. “Asynchrony Begets Momentum, with an Application to Deep Learning.” May 31, 2016. http://arxiv.org/abs/1605.09774.
Neu, Gergely. 2021. “Information-Theoretic Generalization Bounds for Stochastic Gradient Descent,” February. https://arxiv.org/abs/2102.00931v1.
Nguyen, Lam M., Jie Liu, Katya Scheinberg, and Martin Takáč. 2017. “Stochastic Recursive Gradient Algorithm for Nonconvex Optimization.” May 20, 2017. http://arxiv.org/abs/1705.07261.
Patel, Vivak. 2017. “On SGD’s Failure in Practice: Characterizing and Overcoming Stalling.” February 1, 2017. http://arxiv.org/abs/1702.00317.
Polyak, B. T., and A. B. Juditsky. 1992. “Acceleration of Stochastic Approximation by Averaging.” SIAM Journal on Control and Optimization 30 (4): 838–55. https://doi.org/10.1137/0330046.
Reddi, Sashank J., Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. 2016. “Stochastic Variance Reduction for Nonconvex Optimization.” In PMLR, 1603:314–23. http://proceedings.mlr.press/v48/reddi16.html.
Robbins, H., and D. Siegmund. 1971. “A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.” In Optimizing Methods in Statistics, edited by Jagdish S. Rustagi, 233–57. Academic Press. https://doi.org/10.1016/B978-0-12-604550-5.50015-8.
Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3, 3): 400–407. https://doi.org/10.1214/aoms/1177729586.
Ruder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” September 15, 2016. http://arxiv.org/abs/1609.04747.
Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. “Explorations on High Dimensional Landscapes.” December 20, 2014. http://arxiv.org/abs/1412.6615.
Salimans, Tim, and Diederik P Kingma. 2016. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.pdf.
Shalev-Shwartz, Shai, and Ambuj Tewari. 2011. “Stochastic Methods for L1-Regularized Loss Minimization.” Journal of Machine Learning Research 12 (July): 1865–92. http://www.machinelearning.org/archive/icml2009/papers/262.pdf.
Spall, J. C. 2000. “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method.” IEEE Transactions on Automatic Control 45 (10): 1839–53. https://doi.org/10.1109/TAC.2000.880982.
Şimşekli, Umut, Ozan Sener, George Deligiannidis, and Murat A. Erdogdu. 2020. “Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks.” June 16, 2020. http://arxiv.org/abs/2006.09313.
Vishwanathan, S. V. N., Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. 2006. “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In Proceedings of the 23rd International Conference on Machine Learning.
Xu, Wei. 2011. “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.” July 13, 2011. http://arxiv.org/abs/1107.2490.
Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017. “Stochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements.” January 2, 2017. http://arxiv.org/abs/1701.00481.
Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. “Parallelized Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems 23, edited by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, 2595–2603. Curran Associates, Inc. http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.