October 4, 2014 — August 24, 2023

functional analysis
optimization
statmech

Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.

There are only few things I wish to understand for the moment.

Very tidy introduction in Anupam Gupta’s notes for 15-850: CMU Advanced Algorithms, Fall 2020, in particular Lectures 18 and 19.

## 1 Coordinate descent

Descent each coordinate individually.

Small clever hack for certain domains: log gradient descent.

## 2 Momentum

Polyak momentum (that’s the heavy ball one, right?), Nesterov momentum.

How and when does it work? and how well? Moritz Hardt, The zen of gradient descent explains it through Chebychev polynomials. Cheng-Soon Ong recommends d’Aspremont, Scieur, and Taylor (2021) as an overview. Gabriel Goh, Why Momentum Really Works (Goh 2017) is an incredible illustrated guide.

Sebastian Bubeck explains it from a different angle, Revisiting Nesterov’s Acceleration to expand upon the rather magical introduction given in his lecture Wibisono et al explain it in terms of variational approximation. See also Accelerated gradient descent 1 and 2.

Trung Vu’s Convergence of Heavy-Ball Method and Nesterov’s Accelerated Gradient on Quadratic Optimization differentiates Nesterov momentum from heavy ball momentum.

## 3 Continuous approximations of iterations

Recent papers argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)

🏗

## 4 Online versus stochastic

Technically, “online” optimisation in, say, bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

## 6 Mirror descent

See mirror descent.

## 7 References

Agarwal, Chapelle, Dudık, et al. 2014. Journal of Machine Learning Research.
Allen-Zhu, and Hazan. 2016. In Advances in Neural Information Processing Systems 29.
Allen-Zhu, Simchi-Levi, and Wang. 2019. arXiv:1901.02871 [Cs, Math, Stat].
Andersson, Gillis, Horn, et al. 2019. Mathematical Programming Computation.
Bansal, and Gupta. 2019.
Beck, and Teboulle. 2003. Operations Research Letters.
———. 2009. SIAM Journal on Imaging Sciences.
Betancourt, Jordan, and Wilson. 2018. arXiv:1802.03653 [Stat].
Botev, Lever, and Barber. 2016. arXiv:1607.01981 [Cs, Stat].
Bubeck. 2015. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning.
———. 2019. The Five Miracles of Mirror Descent.
Chen. 2012. Mathematical Programming.
Choromanska, Henaff, Mathieu, et al. 2015. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.
d’Aspremont, Scieur, and Taylor. 2021. arXiv:2101.09545 [Cs, Math].
Defazio, Bach, and Lacoste-Julien. 2014. In Advances in Neural Information Processing Systems 27.
DeVore. 1998. Acta Numerica.
Domingos. 2020. arXiv:2012.00152 [Cs, Stat].
Goh. 2017. Distill.
Hinton, Srivastava, and Kevin Swersky. n.d. “Neural Networks for Machine Learning.”
Jacobsen, and Cutkosky. 2022.
Jakovetic, Freitas Xavier, and Moura. 2014. IEEE Transactions on Signal Processing.
Langford, Li, and Zhang. 2009. In Advances in Neural Information Processing Systems 21.
Lee, Panageas, Piliouras, et al. 2017. arXiv:1710.07406 [Cs, Math, Stat].
Lee, Simchowitz, Jordan, et al. 2016. arXiv:1602.04915 [Cs, Math, Stat].
Ma, and Belkin. 2017. arXiv:1703.10622 [Cs, Stat].
Mandt, Hoffman, and Blei. 2017. JMLR.
Nesterov, Y. 2012. SIAM Journal on Optimization.
Nesterov, Yu. 2012. Mathematical Programming.
Nocedal, and Wright. 2006. Numerical Optimization. Springer Series in Operations Research and Financial Engineering.
Richards, and Rabbat. 2021. arXiv:2101.04968 [Cs, Math, Stat].
Ruder. 2017.
Sagun, Guney, Arous, et al. 2014. arXiv:1412.6615 [Cs, Stat].
Wainwright. 2014. Annual Review of Statistics and Its Application.
Wibisono, and Wilson. 2015. arXiv:1509.03616 [Math].
Wibisono, Wilson, and Jordan. 2016. Proceedings of the National Academy of Sciences.
Wright, and Recht. 2021. Optimization for Data Analysis.
Zinkevich. 2003. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.