Adaptive 1st order gradient descent

I need to mention Adam and RMSProp etc somewhere

January 30, 2020 — June 11, 2024

functional analysis
neural nets
stochastic processes
Figure 1


Pragmatically, modern SGD algorithms are often of the adaptive flavour, which means that the learning rate is adaptively tuned for each parameter during the learning process.

Justifications for this are some mixture of theoretical and empirical.

One interesting family of methods tweaks adam to approximate Bayesian inference.

1 References

Khan, Nielsen, Tangkaratt, et al. 2018. Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam.” In Proceedings of the 35th International Conference on Machine Learning.
Kingma, and Ba. 2017. Adam: A Method for Stochastic Optimization.”
Ruder. 2017. An Overview of Gradient Descent Optimization Algorithms.”
Wilson, Roelofs, Stern, et al. 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
Xie, Wang, Zhou, et al. 2018. Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks.”
Zhang, and Mitliagkas. 2018. YellowFin and the Art of Momentum Tuning.”