Gradient flows
infinitesimal optimization
January 30, 2020 — September 28, 2023
Stochastic models of optimisation, especially stochastic gradience descent.
1 Ordinary
Gradient flows we can think of a continuous-limit of gradient descent. There is a (deterministic) ODE corresponding to an infinitesimal trainning rate.
2 Stochastic DE for early stage training
SGD as an SDE (Ljung, Pflug, and Walk 1992; Mandt, Hoffman, and Blei 2017). Worth the price of dusting off the old stochastic calculus. This is used for choosing scaling rules for model training, typically. (Q. Li, Tai, and Weinan 2019; Z. Li, Malladi, and Arora 2021; Malladi et al. 2022)
3 Stochastic DE around the optimum
The limiting diffusion describes diffusion around an optim, i.e. after we have converged. Interesting for understanding generalisation (Gu et al. 2022; Z. Li, Wang, and Arora 2021; Lyu, Li, and Arora 2023; Wang et al. 2023).
They have an interpretation in terms of sampling from a Bayes posterior: See Bayes by Backprop things.