Adaptive 1st order gradient descent
I need to mention Adam and RMSProp etc somewhere
January 30, 2020 — June 11, 2024
functional analysis
neural nets
optimization
SDEs
stochastic processes
Placeholder.
Pragmatically, modern SGD algorithms are often of the adaptive flavour, which means that the learning rate is adaptively tuned for each parameter during the learning process.
Justifications for this are some mixture of theoretical and empirical.
One interesting family of methods tweaks adam to approximate Bayesian inference.
- ADAM: A Method for Stochastic Optimization | theberkeleyview
- An overview of gradient descent optimization algorithms
- Part V: Efficient Natural-gradient Methods for Exponential Family - Wu Lin
- Fascinating connection between Natural Gradients and the Exponential Family – Hodgepodge Notes – Gradient Descent by a Grad Student
1 References
Khan, Nielsen, Tangkaratt, et al. 2018. “Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam.” In Proceedings of the 35th International Conference on Machine Learning.
Kingma, and Ba. 2017. “Adam: A Method for Stochastic Optimization.”
Ruder. 2017. “An Overview of Gradient Descent Optimization Algorithms.”
Wilson, Roelofs, Stern, et al. 2017. “The Marginal Value of Adaptive Gradient Methods in Machine Learning.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
Xie, Wang, Zhou, et al. 2018. “Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks.”
Zhang, and Mitliagkas. 2018. “YellowFin and the Art of Momentum Tuning.”