Gradient flows

infinitesimal optimization

January 30, 2020 — September 28, 2023

functional analysis
neural nets
optimization
SDEs
stochastic processes
Figure 1: Hinze et al. (2021) depict a mosquito’s gradient flow in a 3d optimisation problem.

Stochastic models of optimisation, especially stochastic gradience descent.

1 Ordinary

Gradient flows we can think of a continuous-limit of gradient descent. There is a (deterministic) ODE corresponding to an infinitesimal trainning rate.

2 Stochastic DE for early stage training

SGD as an SDE (Ljung, Pflug, and Walk 1992; Mandt, Hoffman, and Blei 2017). Worth the price of dusting off the old stochastic calculus. This is used for choosing scaling rules for model training, typically. (Q. Li, Tai, and Weinan 2019; Z. Li, Malladi, and Arora 2021; Malladi et al. 2022)

3 Stochastic DE around the optimum

The limiting diffusion describes diffusion around an optim, i.e. after we have converged. Interesting for understanding generalisation (Gu et al. 2022; Z. Li, Wang, and Arora 2021; Lyu, Li, and Arora 2023; Wang et al. 2023).

They have an interpretation in terms of sampling from a Bayes posterior: See Bayes by Backprop things.

4 References

Ambrosio, and Gigli. 2013. A User’s Guide to Optimal Transport.” In Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle. Lecture Notes in Mathematics.
Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Ancona, Ceolini, Öztireli, et al. 2017. Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks.”
Bartlett, Montanari, and Rakhlin. 2021. Deep Learning: A Statistical Viewpoint.” Acta Numerica.
Chizat, and Bach. 2018. On the Global Convergence of Gradient Descent for over-Parameterized Models Using Optimal Transport.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Chu, Minami, and Fukumizu. 2022. The Equivalence Between Stein Variational Gradient Descent and Black-Box Variational Inference.” In.
Di Giovanni, Rowbottom, Chamberlain, et al. 2022. Graph Neural Networks as Gradient Flows.”
Galy-Fajou, Perrone, and Opper. 2021. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation.” Entropy.
Garbuno-Inigo, Hoffmann, Li, et al. 2020. Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler.” SIAM Journal on Applied Dynamical Systems.
Gu, Lyu, Huang, et al. 2022. Why (and When) Does Local SGD Generalize Better Than SGD? In.
Hinze, Lantz, Hill, et al. 2021. Mosquito Host Seeking in 3D Using a Versatile Climate-Controlled Wind Tunnel System.” Frontiers in Behavioral Neuroscience.
Hochreiter, Bengio, Frasconi, et al. 2001. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks.
Li, Zhiyuan, Malladi, and Arora. 2021. On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs).” In Advances in Neural Information Processing Systems.
Li, Qianxiao, Tai, and Weinan. 2019. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations.” In The Journal of Machine Learning Research.
Liu. 2016. Stein Variational Gradient Descent: Theory and Applications.”
———. 2017. Stein Variational Gradient Descent as Gradient Flow.”
Li, Zhiyuan, Wang, and Arora. 2021. What Happens After SGD Reaches Zero Loss? –A Mathematical Framework.” In.
Ljung, Pflug, and Walk. 1992. Stochastic Approximation and Optimization of Random Systems.
Lyu, Li, and Arora. 2023. Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction.” In.
Malladi, Lyu, Panigrahi, et al. 2022. On the SDEs and Scaling Rules for Adaptive Gradient Algorithms.” In Advances in Neural Information Processing Systems.
Mandt, Hoffman, and Blei. 2017. Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR.
Schillings, and Stuart. 2017. Analysis of the Ensemble Kalman Filter for Inverse Problems.” SIAM Journal on Numerical Analysis.
Wang, Malladi, Wang, et al. 2023. The Marginal Value of Momentum for Small Learning Rate SGD.”