Gradient flows

infinitesimal optimization



Hinze et al. (2021) depict a mosquito’s gradient flow in a 3d optimisation problem.

Stochastic models of optimisation, especially stochastic gradience descent.

Ordinary

Gradient flows we can think of a continuous-limit of gradient descent. There is a (deterministic) ODE corresponding to an infinitesimal trainning rate.

Stochastic DE for early stage training

SGD as an SDE (Ljung, Pflug, and Walk 1992; Mandt, Hoffman, and Blei 2017). Worth the price of dusting off the old stochastic calculus. This is used for choosing scaling rules for model training, typically. (Q. Li, Tai, and Weinan 2019; Z. Li, Malladi, and Arora 2021; Malladi et al. 2022)

Stochastic DE around the optimum

The limiting diffusion describes diffusion around an optim, i.e. after we have converged. Interesting for understanding generalisation (Gu et al. 2022; Z. Li, Wang, and Arora 2021; Lyu, Li, and Arora 2023; Wang et al. 2023).

They have an interpretation in terms of sampling from a Bayes posterior: See Bayes by Backprop things.

References

Ambrosio, Luigi, and Nicola Gigli. 2013. A User’s Guide to Optimal Transport.” In Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle, edited by Luigi Ambrosio, Alberto Bressan, Dirk Helbing, Axel Klar, and Enrique Zuazua, 1–155. Lecture Notes in Mathematics. Berlin, Heidelberg: Springer.
Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel.
Ancona, Marco, Enea Ceolini, Cengiz Öztireli, and Markus Gross. 2017. Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks,” November.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. Deep Learning: A Statistical Viewpoint.” Acta Numerica 30 (May): 87–201.
Chizat, Lénaïc, and Francis Bach. 2018. On the Global Convergence of Gradient Descent for over-Parameterized Models Using Optimal Transport.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 3040–50. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.
Chu, Casey, Kentaro Minami, and Kenji Fukumizu. 2022. The Equivalence Between Stein Variational Gradient Descent and Black-Box Variational Inference.” In, 5.
Di Giovanni, Francesco, James Rowbottom, Benjamin P. Chamberlain, Thomas Markovich, and Michael M. Bronstein. 2022. Graph Neural Networks as Gradient Flows.” arXiv.
Galy-Fajou, Théo, Valerio Perrone, and Manfred Opper. 2021. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation.” Entropy 23 (8): 990.
Garbuno-Inigo, Alfredo, Franca Hoffmann, Wuchen Li, and Andrew M. Stuart. 2020. Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler.” SIAM Journal on Applied Dynamical Systems 19 (1): 412–41.
Gu, Xinran, Kaifeng Lyu, Longbo Huang, and Sanjeev Arora. 2022. Why (and When) Does Local SGD Generalize Better Than SGD? In.
Hinze, Annika, Jörgen Lantz, Sharon R. Hill, and Rickard Ignell. 2021. Mosquito Host Seeking in 3D Using a Versatile Climate-Controlled Wind Tunnel System.” Frontiers in Behavioral Neuroscience 15 (March): 643693.
Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Li, Qianxiao, Cheng Tai, and E. Weinan. 2019. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations.” In The Journal of Machine Learning Research, 20:1474–1520.
Li, Zhiyuan, Sadhika Malladi, and Sanjeev Arora. 2021. On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs).” In Advances in Neural Information Processing Systems, 34:12712–25. Curran Associates, Inc.
Li, Zhiyuan, Tianhao Wang, and Sanjeev Arora. 2021. What Happens After SGD Reaches Zero Loss? –A Mathematical Framework.” In.
Liu, Qiang. 2016. Stein Variational Gradient Descent: Theory and Applications,” 6.
———. 2017. Stein Variational Gradient Descent as Gradient Flow.” arXiv.
Ljung, Lennart, Georg Pflug, and Harro Walk. 1992. Stochastic Approximation and Optimization of Random Systems. Basel: Birkhäuser.
Lyu, Kaifeng, Zhiyuan Li, and Sanjeev Arora. 2023. Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction.” In. arXiv.
Malladi, Sadhika, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. 2022. On the SDEs and Scaling Rules for Adaptive Gradient Algorithms.” In Advances in Neural Information Processing Systems, 35:7697–7711.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April.
Schillings, Claudia, and Andrew M. Stuart. 2017. Analysis of the Ensemble Kalman Filter for Inverse Problems.” SIAM Journal on Numerical Analysis 55 (3): 1264–90.
Wang, Runzhe, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. 2023. The Marginal Value of Momentum for Small Learning Rate SGD.” arXiv.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.