a.k.a. SGD

January 30, 2020 — May 19, 2023

functional analysis
neural nets
optimization
SDEs
stochastic processes

Stochastic optimization, uses noisy (possibly approximate) 1st-order gradient information to find the argument which minimises

$x^*=\operatorname{arg min}_{\mathbf{x}} f(x)$

for some an objective function $$f:\mathbb{R}^n\to\mathbb{R}$$.

That this works with little fuss in very high dimensions is a major pillar of deep learning.

1 Basic

The original version, in terms of root finding, is who later generalised analysis in , using martingale arguments to analyze convergence. There is some historical context in (Lai 2003) which puts it all in context. That article was written before the current craze for SGD in deep learning; after 2013 or so the problem is rather that there is so much information on the method that the challenge becomes sifting out the AI hype from the useful.

I recommend Francis Bach’s Sum of geometric series trick as an introduction to showing things advanced things about SGD using elementary tools.

Francesco Orabana on how to prove SGD converges:

to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

3 Variance-reduced

🏗

Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

6 Normalized

You may remember our previous blog post showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically, normalized nets, which is our term for nets that use any one of popular normalization schemes,e.g. BatchNorm (BN), GroupNorm (GN), WeightNorm (WN). Today’s post (based upon our paper with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses.

7 In MCMC

See SGMCMC.

What we do in practice in nueral nets is do adaptive tuning of GD rates. See Adaptive SGD.

9 Incoming

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am mostly thinking about SGD here.

There are many more permutations and variations used in practice.

10 References

Ahn, Korattikara, and Welling. 2012. In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.
Alexos, Boyd, and Mandt. 2022. In Proceedings of the 39th International Conference on Machine Learning.
Arya, Schauer, Schäfer, et al. 2022. In.
Bach, Francis, and Moulines. 2011. In Advances in Neural Information Processing Systems (NIPS).
Bach, Francis R., and Moulines. 2013. In arXiv:1306.2119 [Cs, Math, Stat].
Benaïm. 1999. In Séminaire de Probabilités de Strasbourg. Lecture Notes in Math.
Bensoussan, Li, Nguyen, et al. 2020. arXiv:2006.05604 [Cs, Math, Stat].
Botev, and Lloyd. 2015. Electronic Journal of Statistics.
Bottou. 1991. In Proceedings of Neuro-Nîmes 91.
———. 1998. In Online Learning and Neural Networks.
———. 2010. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010).
Bottou, and Bousquet. 2008. In Advances in Neural Information Processing Systems.
Bottou, Curtis, and Nocedal. 2016. arXiv:1606.04838 [Cs, Math, Stat].
Bottou, and LeCun. 2004. In Advances in Neural Information Processing Systems 16.
Bubeck. 2015. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning.
Cevher, Becker, and Schmidt. 2014. IEEE Signal Processing Magazine.
Chaudhari, Choromanska, Soatto, et al. 2017.
Chen, Xiaojun. 2012. Mathematical Programming.
Chen, Tianqi, Fox, and Guestrin. 2014. In Proceedings of the 31st International Conference on Machine Learning.
Chen, Zaiwei, Mou, and Maguluri. 2021.
Di Giovanni, Rowbottom, Chamberlain, et al. 2022.
Domingos. 2020. arXiv:2012.00152 [Cs, Stat].
Duchi, Hazan, and Singer. 2011. Journal of Machine Learning Research.
Feng, and Tu. 2021. Proceedings of the National Academy of Sciences.
Friedlander, and Schmidt. 2012. SIAM Journal on Scientific Computing.
Ghadimi, and Lan. 2013a. SIAM Journal on Optimization.
———. 2013b. arXiv:1310.3787 [Math].
Goh. 2017. Distill.
Hazan, Levy, and Shalev-Shwartz. 2015. In Advances in Neural Information Processing Systems 28.
Heyde. 1974. Stochastic Processes and Their Applications.
Hu, Pan, and Kwok. 2009. In Advances in Neural Information Processing Systems.
Jakovetic, Freitas Xavier, and Moura. 2014. IEEE Transactions on Signal Processing.
Kidambi, Netrapalli, Jain, et al. 2023. In.
Kingma, and Ba. 2015. Proceeding of ICLR.
Lai. 2003. The Annals of Statistics.
Lee, Panageas, Piliouras, et al. 2017. arXiv:1710.07406 [Cs, Math, Stat].
Lee, Simchowitz, Jordan, et al. 2016. arXiv:1602.04915 [Cs, Math, Stat].
Liu, and Wang. 2019. In Advances In Neural Information Processing Systems.
Ljung, Pflug, and Walk. 1992. Stochastic Approximation and Optimization of Random Systems.
Ma, and Belkin. 2017. arXiv:1703.10622 [Cs, Stat].
Maclaurin, Duvenaud, and Adams. 2015. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.
Mairal. 2013. In Advances in Neural Information Processing Systems.
Mandt, Hoffman, and Blei. 2017. JMLR.
McMahan, Holt, Sculley, et al. 2013. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’13.
Mitliagkas, Zhang, Hadjis, et al. 2016. arXiv:1605.09774 [Cs, Math, Stat].
Neu, Dziugaite, Haghifam, et al. 2021. arXiv:2102.00931 [Cs, Stat].
Nguyen, Liu, Scheinberg, et al. 2017. arXiv:1705.07261 [Cs, Math, Stat].
Patel. 2017. arXiv:1702.00317 [Cs, Math, Stat].
Polyak, and Juditsky. 1992. SIAM Journal on Control and Optimization.
Reddi, Hefny, Sra, et al. 2016. In PMLR.
Robbins, Herbert, and Monro. 1951. The Annals of Mathematical Statistics.
Robbins, H., and Siegmund. 1971. In Optimizing Methods in Statistics.
Ruder. 2017.
Sagun, Guney, Arous, et al. 2014. arXiv:1412.6615 [Cs, Stat].
Salimans, and Kingma. 2016. In Advances in Neural Information Processing Systems 29.
Shalev-Shwartz, and Tewari. 2011. Journal of Machine Learning Research.
Şimşekli, Sener, Deligiannidis, et al. 2020. CoRR.
Smith, Dherin, Barrett, et al. 2020. In.
Spall. 2000. IEEE Transactions on Automatic Control.
Sun, Yang, Xun, et al. 2023. ACM Transactions on Knowledge Discovery from Data.
Vishwanathan, Schraudolph, Schmidt, et al. 2006. “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In Proceedings of the 23rd International Conference on Machine Learning.
Welling, and Teh. 2011. In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Wright, and Recht. 2021. Optimization for Data Analysis.
Xu. 2011. arXiv:1107.2490 [Cs].
Zhang, Jian, and Mitliagkas. 2018.
Zhang, Xiao, Wang, and Gu. 2017. arXiv:1701.00481 [Stat].
Zinkevich, Weimer, Li, et al. 2010. In Advances in Neural Information Processing Systems 23.