Gradient descent, Newton-like, stochastic



\[\renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}\]

Stochastic Newton-type optimization, unlike deterministic Newton optimisation, uses noisy (possibly approximate) 2nd-order gradient information to find the argument which minimises \[ x^*=\operatorname{arg min}_{\mathbf{x}} f(x) \] for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).

Subsampling data

Most of the good tricks here are set up for ML-style training losses where the bottleneck is summing a large number of loss functions.

LiSSA attempts to make 2nd order gradient descent methods scale to large parameter sets (Agarwal, Bullins, and Hazan 2016):

a linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since \(H^{t+1}y^t\) can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

(Kovalev, Mishchenko, and RichtΓ‘rik 2019) uses a decomposition of the objective into a sum of simple functions (the classic SGD setup for neural nets, typical of an online optimisation.

What do (F. Bach and Moulines 2011; F. R. Bach and Moulines 2013) get us?

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ('large n') and each of these is large ('large p'). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

General case

Rather than observing \(\nabla f, \nabla^2 f\) we observe some random variables \(G(x),H(x)\) with \(\bb{E}G=\nabla f\) and \(\bb{E}(H)=\nabla^2 f,\) not necessarily decomposable into a sum.

πŸ—

Online Newton’s method

Bookmarked: I am sitting in Iman Shames’ seminar on some recent papers (Gravell, Shames, and Summers 2021; Lesage-Landry, Taylor, and Shames 2021; Pavlov, Shames, and Manzie 2020):

In this talk we revisit one of the most important workhorses of numerical optimisation: the Newton’s method. We will (hopefully) provide some insight into the role it still plays in modern applications of control and learning theory. We first look into its regret analysis when applied to online nonconvex optimisation problems, then we shift our focus to how it and its very close relative, the midpoint Newton’s method, play leading roles in learning dynamical systems, and conclude with reviewing how it acts as a hero in disguise in solving differential dynamic programming problems.

Looks worthwhile.

Online Newton’s Method (ONM): \(\quad x_{t+1}=x_{t}-H_{t}^{-1}\left(x_{t}\right) \nabla f_{t}\left(x_{t}\right)\)

Let \(x_{t+1}^{*}=x_{t}^{*}+v_{t}\) with \(\max _{t}\left\|v_{t}\right\| \leq \bar{v}\) and \(V_{T}=\sum_{t=1}^{T}\left\|x_{t+1}^{*}-x_{t}^{*}\right\|=\sum_{t=1}^{T}\left\|v_{t}\right\|\) Some Assumptions \[ \begin{aligned} \exists h_{t}>0: &\left\|H_{t}^{-1}\left(x_{t}^{*}\right)\right\| \leq \frac{1}{h_{t}} \\ \exists \beta_{t}, L_{t}>0: &\left\|x-x_{t}^{*}\right\| \leq \beta_{t}=\left\|H_{t}(x)-H_{t}\left(x_{t}^{*}\right)\right\| \leq L_{t}\left\|x-x_{t}^{*}\right\| \\ \left\|x_{t}-x_{t}^{*}\right\| & \leq \gamma_{t}=\min \left[\beta_{t}, \frac{2 h_{t}}{3 L_{t}}\right] \end{aligned} \] Lemma \[ \left((\mathrm{ONM}) \wedge\left(\bar{v} \leq \gamma-\frac{3 L}{2 h} \gamma^{2}\right)\right) \Longrightarrow\left(\left\|x_{t+1}-x_{t+1}^{*}\right\|<\gamma\right) \] Another assumption: \(\exists \ell>0\) such that \(\left\|x-x_{t}^{*}\right\| \leq \gamma \Longrightarrow\left|f_{t}(x)-f_{t}\left(x_{t}^{*}\right)\right| \leq\left\|x-x_{t}^{*}\right\|\) for all \(t=1,2, \ldots, T\). Theorem The regret of ONM is bounded above: \[ \operatorname{Reg}_{T}^{D, 0 N M} \leq \frac{\ell}{1-\frac{3 L}{2 h} \gamma}\left(V_{T}+\delta\right) \] where \(\delta=\frac{3 L}{2 h}\left(\left\|x_{0}-x_{0}^{*}\right\|^{2}-\left\|x_{T}-x_{T}^{*}\right\|^{2}\right)\)

It is the same as the regret for strongly-convex functions.

Take-home message seems to be Newton’s methods are efficient for sufficiently online control problems rather as you would expect. It is not clear to me that getting a decent Quasi-Newton method is clear or obvious?

Subsampling parameters

Hu et al. (2022):

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is convergence rate in non-convex settings, both in theory and practice.

To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-periteration. For a typical neural network with \(m=\operatorname{poly}(n)\) parameters and input batch of \(n\) datapoints in \(\mathbb{R}^{d}\), the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires \(\sim m n d+n^{3}\) time per iteration. In this paper, we present a novel training method that requires only \(m^{1-\alpha} n d+n^{3}\) amortized time in the same DNNs. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of DNNs.

References

Abt, Markus, and William J. Welch. 1998. β€œFisher Information and Maximum-Likelihood Estimation of Covariance Parameters in Gaussian Stochastic Processes.” Canadian Journal of Statistics 26 (1): 127–37.
Agarwal, Naman, Brian Bullins, and Elad Hazan. 2016. β€œSecond Order Stochastic Optimization in Linear Time.” arXiv:1602.03943 [Cs, Stat], February.
Amari, Shun-ichi. 1998. β€œNatural Gradient Works Efficiently in Learning.” Neural Computation 10 (2): 251–76.
Amari, Shun-ichi, Ryo Karakida, and Masafumi Oizumi. 2018. β€œFisher Information and Natural Gradient Learning of Random Deep Networks.” arXiv:1808.07172 [Cond-Mat, Stat], August.
Amari, Shun-ichi, Hyeyoung Park, and Kenji Fukumizu. 2000. β€œAdaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons.” Neural Computation 12 (6): 1399–409.
Arbel, Michael, Arthur Gretton, Wuchen Li, and Guido Montufar. 2020. β€œKernelized Wasserstein Natural Gradient.” arXiv.
Arnold, SΓ©bastien M. R., and Chunming Wang. 2017. β€œAccelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix.” In arXiv:1709.05069 [Cs].
Ba, Jimmy, Roger Grosse, and James Martens. 2016. β€œDistributed Second-Order Optimization Using Kronecker-Factored Approximations,” November.
Bach, Francis R., and Eric Moulines. 2013. β€œNon-Strongly-Convex Smooth Stochastic Approximation with Convergence Rate O(1/n).” In arXiv:1306.2119 [Cs, Math, Stat], 773–81.
Bach, Francis, and Eric Moulines. 2011. β€œNon-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.” In Advances in Neural Information Processing Systems (NIPS), –. Spain.
Battiti, Roberto. 1992. β€œFirst-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” Neural Computation 4 (2): 141–66.
Bordes, Antoine, LΓ©on Bottou, and Patrick Gallinari. 2009. β€œSGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.” Journal of Machine Learning Research 10 (December): 1737–54.
Botev, Aleksandar, Hippolyt Ritter, and David Barber. 2017. β€œPractical Gauss-Newton Optimisation for Deep Learning.” In Proceedings of the 34th International Conference on Machine Learning, 557–65. PMLR.
Bottou, LΓ©on. 2012. β€œStochastic Gradient Descent Tricks.” In Neural Networks: Tricks of the Trade, 421–36. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg.
Byrd, R. H., S. L. Hansen, Jorge. Nocedal, and Y. Singer. 2016. β€œA Stochastic Quasi-Newton Method for Large-Scale Optimization.” SIAM Journal on Optimization 26 (2): 1008–31.
Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. β€œHessian-Free Optimization for Learning Deep Multidimensional Recurrent Neural Networks.” In Advances In Neural Information Processing Systems.
Dangel, Felix, Frederik Kunstner, and Philipp Hennig. 2019. β€œBackPACK: Packing More into Backprop.” In International Conference on Learning Representations.
Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. β€œIdentifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc.
Detommaso, Gianluca, Tiangang Cui, Alessio Spantini, Youssef Marzouk, and Robert Scheichl. 2018. β€œA Stein Variational Newton Method.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 9187–97. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.
Efron, Bradley, and David V. Hinkley. 1978. β€œAssessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected Fisher Information.” Biometrika 65 (3): 457–83.
Gravell, Benjamin, Iman Shames, and Tyler Summers. 2021. β€œApproximate Midpoint Policy Iteration for Linear Quadratic Control.” arXiv:2011.14212 [Cs, Eess, Math], February.
Grosse, Roger. 2021. β€œMetrics.” In CSC2541 Winter 2021, Chapter 3.
Grosse, Roger, and James Martens. 2016. β€œA Kronecker-Factored Approximate Fisher Matrix for Convolution Layers.” In Proceedings of The 33rd International Conference on Machine Learning, 573–82. PMLR.
Hensman, James, Magnus Rattray, and Neil Lawrence. 2012. β€œFast Variational Inference in the Conjugate Exponential Family.” In Advances in Neural Information Processing Systems. Vol. 25. Curran Associates, Inc.
Hu, Hang, Zhao Song, Omri Weinstein, and Danyang Zhuo. 2022. β€œTraining Overparametrized Neural Networks in Sublinear Time.” arXiv.
Kakade, Sham M. 2002. β€œA Natural Policy Gradient.” In Advances In Neural Information Processing Systems, 8.
Karakida, Ryo, and Kazuki Osawa. 2020. β€œUnderstanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks.” Advances in Neural Information Processing Systems 33.
Khan, Mohammad Emtiyaz, and HΓ₯vard Rue. 2022. β€œThe Bayesian Learning Rule.” arXiv.
Kovalev, Dmitry, Konstantin Mishchenko, and Peter RichtΓ‘rik. 2019. β€œStochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates.” arXiv:1912.01597 [Cs, Math, Stat], December.
Lesage-Landry, Antoine, Joshua A. Taylor, and Iman Shames. 2021. β€œSecond-Order Online Nonconvex Optimization.” IEEE Transactions on Automatic Control 66 (10): 4866–72.
Ljung, Lennart, Georg Pflug, and Harro Walk. 1992. Stochastic Approximation and Optimization of Random Systems. Basel: BirkhΓ€user.
Lucchi, Aurelien, Brian McWilliams, and Thomas Hofmann. 2015. β€œA Variance Reduced Stochastic Newton Method.” arXiv:1503.08316 [Cs], March.
Ly, Alexander, Maarten Marsman, Josine Verhagen, Raoul P. P. P. Grasman, and Eric-Jan Wagenmakers. 2017. β€œA Tutorial on Fisher Information.” Journal of Mathematical Psychology 80 (October): 40–55.
Martens, James. 2010. β€œDeep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735–42. ICML’10. USA: Omnipress.
β€”β€”β€”. 2016. β€œSecond-Order Optimization for Neural Networks.” University of Toronto.
β€”β€”β€”. 2020. β€œNew Insights and Perspectives on the Natural Gradient Method.” Journal of Machine Learning Research 21 (146): 1–76.
Martens, James, and Roger Grosse. 2015. β€œOptimizing Neural Networks with Kronecker-Factored Approximate Curvature.” In Proceedings of the 32nd International Conference on Machine Learning, 2408–17. PMLR.
Martens, James, and Ilya Sutskever. 2011. β€œLearning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress.
β€”β€”β€”. 2012. β€œTraining Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade, 479–535. Lecture Notes in Computer Science. Springer.
Mosegaard, Klaus, and Albert Tarantola. 2002. β€œProbabilistic Approach to Inverse Problems.” In International Geophysics, 81:237–65. Elsevier.
Nielsen, Frank. 2018. β€œAn Elementary Introduction to Information Geometry.” arXiv:1808.08271 [Cs, Math, Stat], August.
Nurbekyan, Levon, Wanzhou Lei, and Yunan Yang. 2022. β€œEfficient Natural Gradient Descent Methods for Large-Scale Optimization Problems.” arXiv.
Ollivier, Yann. 2017. β€œOnline Natural Gradient as a Kalman Filter.” arXiv:1703.00209 [Math, Stat], March.
Osawa, Kazuki, Satoki Ishikawa, Rio Yokota, Shigang Li, and Torsten Hoefler. 2023. β€œASDL: A Unified Interface for Gradient Preconditioning in PyTorch.” arXiv.
Pavlov, Andrei, Iman Shames, and Chris Manzie. 2020. β€œInterior Point Differential Dynamic Programming.” arXiv:2004.12710 [Cs, Eess, Math], October.
Robbins, H., and D. Siegmund. 1971. β€œA Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.” In Optimizing Methods in Statistics, edited by Jagdish S. Rustagi, 233–57. Academic Press.
Ruppert, David. 1985. β€œA Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.” The Annals of Statistics 13 (1): 236–45.
Salimbeni, Hugh, Stefanos Eleftheriadis, and James Hensman. 2018. β€œNatural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models.” In International Conference on Artificial Intelligence and Statistics, 689–97.
Schraudolph, Nicol N. 2002. β€œFast Curvature Matrix-Vector Products for Second-Order Gradient Descent.” Neural Computation 14 (7): 1723–38.
Schraudolph, Nicol N., Jin Yu, and Simon GΓΌnter. 2007. β€œA Stochastic Quasi-Newton Method for Online Convex Optimization.” In Artificial Intelligence and Statistics, 436–43.
Wilkinson, William J., Simo SΓ€rkkΓ€, and Arno Solin. 2021. β€œBayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees.” arXiv.
Yao, Zhewei, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2020. β€œPyHessian: Neural Networks Through the Lens of the Hessian.” In arXiv:1912.07145 [Cs, Math].
Yurtsever, Alp, Joel A. Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. 2021. β€œScalable Semidefinite Programming.” SIAM Journal on Mathematics of Data Science 3 (1): 171–200.
Zellner, Arnold. 1988. β€œOptimal Information Processing and Bayes’s Theorem.” The American Statistician 42 (4): 278–80.
Zhang, Guodong, Shengyang Sun, David Duvenaud, and Roger Grosse. 2018. β€œNoisy Natural Gradient as Variational Inference.” In Proceedings of the 35th International Conference on Machine Learning, 5852–61. PMLR.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.