Gradient descent

First order of business



Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.

There are only few things I wish to understand for the moment

Coordinate descent

Descent each coordinate individually.

Small clever hack for certain domains: log gradient descent.

Accelerated

How and when does it work? and how well? Moritz Hardt, The zen of gradient descent explains it through Chebychev polynomials. Cheng-Soon Ong recommends d’Aspremont, Scieur, and Taylor (2021) as an overview.

Sebastian Bubeck explains it from a different angle, Revisiting Nesterov’s Acceleration to expand upon the rather magical introduction given in his lecture Wibisono et al explain it in terms of variational approximation See also Accelerated gradient descent 1 and 2.

Continuous approximations of iterations

Recent papers (Wibisono and Wilson 2015; Wibisono, Wilson, and Jordan 2016) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)

🏗

Online versus stochastic

Technically, “online” optimisation in, say, bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

Conditional Gradient

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.

References

Agarwal, Alekh, Olivier Chapelle, Miroslav Dudık, and John Langford. 2014. “A Reliable Effective Terascale Linear Learning System.” Journal of Machine Learning Research 15 (1): 1111–33. http://www.jmlr.org/papers/volume15/agarwal14a/agarwal14a.pdf.
Allen-Zhu, Zeyuan, and Elad Hazan. 2016. “Optimal Black-Box Reductions Between Optimization Objectives.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1606–14. Curran Associates, Inc. http://papers.nips.cc/paper/6364-optimal-black-box-reductions-between-optimization-objectives.pdf.
Allen-Zhu, Zeyuan, David Simchi-Levi, and Xinshang Wang. 2019. “The Lingering of Gradients: How to Reuse Gradients over Time.” arXiv:1901.02871 [cs, Math, Stat], January. http://arxiv.org/abs/1901.02871.
Andersson, Joel A. E., Joris Gillis, Greg Horn, James B. Rawlings, and Moritz Diehl. 2019. “CasADi: A Software Framework for Nonlinear Optimization and Optimal Control.” Mathematical Programming Computation 11 (1): 1–36. https://doi.org/10.1007/s12532-018-0139-4.
Aspremont, Alexandre d’, Damien Scieur, and Adrien Taylor. 2021. “Acceleration Methods.” arXiv:2101.09545 [cs, Math], January. http://arxiv.org/abs/2101.09545.
Beck, Amir, and Marc Teboulle. 2003. “Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization.” Operations Research Letters 31 (3): 167–75. https://doi.org/10.1016/S0167-6377(02)00231-6.
———. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences 2 (1): 183–202. https://doi.org/10.1137/080716542.
Betancourt, Michael, Michael I. Jordan, and Ashia C. Wilson. 2018. “On Symplectic Optimization.” arXiv:1802.03653 [stat], February. http://arxiv.org/abs/1802.03653.
Botev, Aleksandar, Guy Lever, and David Barber. 2016. “Nesterov’s Accelerated Gradient and Momentum as Approximations to Regularised Update Descent.” arXiv:1607.01981 [cs, Stat], July. http://arxiv.org/abs/1607.01981.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Vol. 8. Foundations and Trends in Machine Learning. Now Publishers. https://doi.org/10.1561/2200000050.
Chen, Xiaojun. 2012. “Smoothing Methods for Nonsmooth, Nonconvex Minimization.” Mathematical Programming 134 (1): 71–99. https://doi.org/10.1007/s10107-012-0569-0.
Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. “The Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204. http://proceedings.mlr.press/v38/choromanska15.html.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. 2014. “SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.” In Advances in Neural Information Processing Systems 27. http://arxiv.org/abs/1407.0202.
DeVore, Ronald A. 1998. “Nonlinear Approximation.” Acta Numerica 7 (January): 51–150. https://doi.org/10.1017/S0962492900002816.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [cs, Stat], November. http://arxiv.org/abs/2012.00152.
Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. n.d. “Neural Networks for Machine Learning.”
Jakovetic, D., J.M. Freitas Xavier, and J.M.F. Moura. 2014. “Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks.” IEEE Transactions on Signal Processing 62 (4): 868–82. https://doi.org/10.1109/TSP.2013.2291221.
Langford, John, Lihong Li, and Tong Zhang. 2009. “Sparse Online Learning via Truncated Gradient.” In Advances in Neural Information Processing Systems 21, edited by D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 905–12. Curran Associates, Inc. http://papers.nips.cc/paper/3585-sparse-online-learning-via-truncated-gradient.pdf.
Lee, Jason D., Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2017. “First-Order Methods Almost Always Avoid Saddle Points.” arXiv:1710.07406 [cs, Math, Stat], October. http://arxiv.org/abs/1710.07406.
Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. “Gradient Descent Converges to Minimizers.” arXiv:1602.04915 [cs, Math, Stat], March. http://arxiv.org/abs/1602.04915.
Ma, Siyuan, and Mikhail Belkin. 2017. “Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning.” arXiv:1703.10622 [cs, Stat], March. http://arxiv.org/abs/1703.10622.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Nesterov, Y. 2012. “Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems.” SIAM Journal on Optimization 22 (2): 341–62. https://doi.org/10.1137/100802001.
Nesterov, Yu. 2012. “Gradient Methods for Minimizing Composite Functions.” Mathematical Programming 140 (1): 125–61. https://doi.org/10.1007/s10107-012-0629-5.
Nocedal, Jorge, and S. Wright. 2006. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer-Verlag. https://www.springer.com/gp/book/9780387303031.
Ruder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” arXiv:1609.04747 [cs], September. http://arxiv.org/abs/1609.04747.
Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. “Explorations on High Dimensional Landscapes.” arXiv:1412.6615 [cs, Stat], December. http://arxiv.org/abs/1412.6615.
Wainwright, Martin J. 2014. “Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues.” Annual Review of Statistics and Its Application 1 (1): 233–53. https://doi.org/10.1146/annurev-statistics-022513-115643.
Wibisono, Andre, and Ashia C. Wilson. 2015. “On Accelerated Methods in Optimization.” arXiv:1509.03616 [math], September. http://arxiv.org/abs/1509.03616.
Wibisono, Andre, Ashia C. Wilson, and Michael I. Jordan. 2016. “A Variational Perspective on Accelerated Methods in Optimization.” Proceedings of the National Academy of Sciences 113 (47): E7351–58. https://doi.org/10.1073/pnas.1614734113.
Zinkevich, Martin. 2003. “Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 928–35. ICML’03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041955.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.