Gradient descent, Newton-like



NB ⚠️⛔️☣️: Under reconstruction; I think I need to characterise the 2nd-order Hessian approximations better.

Newton-type optimization, unlike basic gradient descent, uses (possibly approximate) 2nd-order gradient information to find the argument which minimises \[ x^*=\operatorname{arg min}_{\mathbf{x}} f(x) \] for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).

Optimization over arbitrary functions typically gets discussed in terms of line-search and trust-region methods, both of which can be construed, AFAICT, as second order methods. Although I don’t know of a first order trust region method; what would that even be? I think I don’t need to care about the distinction for the current purpose. We’ll see.

Can one do higher-than-2nd order optimisation? Of course. In practice, 2nd order is usually fast enough, and has enough complications to keep one busy, so the greater complexity and per-iteration convergence rate of a 3rd order method does not usually offer an attractive trade-off.

Here are some scattered notes which make no claim to comprehensiveness, comprehensibility or coherence.

General

Robert Grosse, Second-Order Optimization is a beautiful introduction as part of CSC2541 Topics in Machine Learning: Neural Net Training Dynamics. There is a whole series (Grosse 2021a, 2021b, 2021c) of IMO beautifully written lectures on this.

One-dimensional Newton

The basic case is simple; we get into difficulties with the multidimensional stuff in a moment. From school one might remember that classic so-called Newton method for iteratively improving a guessed location \(x_{k}\) for a root of a 1-dimensional function \(g:\mathbb{R}\to\mathbb{R}\):

\[ x_{k+1}=x_{k}-\frac{g\left(x_{k}\right)}{g'\left(x_{k}\right)} \]

Newton iteration animation.

Under the assumption that it is twice continuously differentiable, we can find stationary points—zeros of the derivative—of a function \(f:\mathbb{R}\to\mathbb{R}\). (i.e. extrema and saddle points, which might in fact be minima.) \[ x_{k+1}=x_{k}-\frac{f'\left(x_{k}\right)}{f''\left(x_{k}\right)} \]

How and why does this work again? Loosely, the argument is that we can use the Taylor expansion of a function to produce a locally quadratic model for the function, which might be not at all globally quadratic.

Multidimensional Newton

Returning to the objective with a multidimensional argument \(f:\mathbb{R}^n\to\mathbb{R},\) Taylor’s theorem for twice continuously differentiable functions tells us that \[\begin{aligned} f(x+p)&= f(x)+\nabla^{\top} f (x+p)p +\frac{1}{2}p^{\top} \nabla^2f(x+p)p + p^{\top}R(x+p)p\\ \end{aligned}\] and \(\lim _{p \rightarrow 0} R(x+p)=0\).

Putting the remainder in Cauchy form, this tells us \[ f(x+p)=f(x)+\nabla f^{\top}(x)p+\frac{1}{2}p^{\top}\nabla^2f(x+tp)p \] for some \(t\in(0,1)\).

Generally, we might take a local quadratic model for \(f\) at \(x+p\) as \[ \nabla f(x+p)\simeq m_x(p)= f(x)+\nabla f^{\top}(x)p+\frac{1}{2}p^{\top}\nabla^2f(x)p \] for some \(t\in(0,1)\).

If we ignore some subtleties and imagined this quadratic model was true, it would have an extremum where the directional derivative of this model vanished, i.e. \(\nabla m_x(p)\equiv 0\) where \(\nabla m_x(p):=\nabla f(x)+\nabla^2 f(x)p\). This would be satisfied at \[\begin{aligned} \nabla f(x)+\nabla^2 f(x)p &= 0\\ \nabla f(x)&=-\nabla^2 f(x)p\\ p&=-[\nabla^2 f(x)]^{-1}\nabla f(x)\\ \end{aligned}\]

From this we get the multidimensional Newton update rule, by choosing for each \(x_{k}\) the locally optimal \(p\) quadratic approximation: \[ x_{k+1}= x_{k}-\left[ \nabla^2 f\left( x_{k}\right)\right]^{-1} \nabla f\left( x_{k}\right), n \geq 0 \]

In practice the quadratic model is not true (or we would be done in one step) and we would choose a more robust method that by using a trust region or line-search algorithm to handle the divergence between the model and its quadratic approximation. Either way the quadratic approximation would be used, which would mean that the inverse Hessian matrix \(\left[ \nabla^2 f\left( x_{k}\right)\right]^{-1}\) would be important.

🏗 Unconstrained assumption, need for saddle points not just extrema with Lagrange methods, positive-definiteness etc.

Gauss-Newton

A reasonably clear Stackeschange answer that will do for now: Difference between Newton’s method and Gauss-Newton method.

Generalized Gauss-Newton

NB ⚠️⛔️☣️: This sections is vague half-arsed notes of dubious accuracy.

The generalized Gauss-Newton approximation (GGN) (Martens and Grosse 2015). replaces an expensive second order derivative by a product of first order derivatives. Used in (Foong et al. 2019; Immer, Korzepa, and Bauer 2021).

Fisher method

I haven’t actually worked out what this is yet, nor whether it relates or is filed correctly there. But I think that it is natural gradient descent.

Quasi Newton methods

Secant conditions and approximations to the Hessian. Let’s say we are designing a second-order update method. See e.g. Nocedal and Wright (2006).

In BFGS the approximate Hessian \(B_k\) is used in place of the true Hessian.

Given the displacement \(s_{k}\) and the change of gradients \(y_{k}\), the secant equation requires that the symmetric positive definite matrix \(B_{k+1}\) map \(s_{k}\) into \(y_{k}.\)

\[ H_k(x^{(k)} − x^{(k−1)}) = \nabla f(x^{(k)}) − \nabla f(x^{(k−1)}) \]

… We hope to get from this the secant condition.

\[ B_{k+1} s_{k}=y_{k} \]

TODO: compare and contrast to the Wolfe conditions on step length.

🏗

Hessian free

Second order optimisation that does not require the Hessian matrix to be given explicitly. I should work out the boundaries of this definition. Does it include Quasi Newton? Intersect with? Overlap? 🏗

Andre Gibiansky wrote an example for coders.

Stochastic

See stochastic 2nd order gradient descent for some bonus complications.

References

Allen-Zhu, Zeyuan, and Elad Hazan. 2016. Optimal Black-Box Reductions Between Optimization Objectives.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1606–14. Curran Associates, Inc.
Anil, Rohan, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. 2021. Scalable Second Order Optimization for Deep Learning.” arXiv.
Arnold, Sébastien M. R., and Chunming Wang. 2017. Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix.” In arXiv:1709.05069 [Cs].
Ba, Jimmy, Roger Grosse, and James Martens. 2016. Distributed Second-Order Optimization Using Kronecker-Factored Approximations,” November.
Bach, Francis, Rodolphe Jenatton, and Julien Mairal. 2011. Optimization With Sparsity-Inducing Penalties. Foundations and Trends(r) in Machine Learning 1.0. Now Publishers Inc.
Battiti, Roberto. 1992. First-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” Neural Computation 4 (2): 141–66.
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016. Optimization Methods for Large-Scale Machine Learning.” arXiv:1606.04838 [Cs, Math, Stat], June.
Boumal, Nicolas, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. 2014. Manopt, a Matlab Toolbox for Optimization on Manifolds.” Journal of Machine Learning Research 15: 1455–59.
Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. Hessian-Free Optimization for Learning Deep Multidimensional Recurrent Neural Networks.” In Advances In Neural Information Processing Systems.
Dangel, Felix, Frederik Kunstner, and Philipp Hennig. 2019. BackPACK: Packing More into Backprop.” In International Conference on Learning Representations.
Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc.
Dennis, J. E., and Robert B. Schnabel. 1989. Chapter I A View of Unconstrained Optimization.” In Handbooks in Operations Research and Management Science, 1:1–72. Optimization. Elsevier.
Dennis, J., and R. Schnabel. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Vol. 16. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.
Foong, Andrew Y. K., Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. 2019. ‘In-Between’ Uncertainty in Bayesian Neural Networks.” arXiv:1906.11537 [Cs, Stat], June.
Gill, Jeff, and Gary King. 2004. What to Do When Your Hessian Is Not Invertible: Alternatives to Model Respecification in Nonlinear Estimation.” Sociological Methods & Research 33 (1): 54–87.
Gravell, Benjamin, Iman Shames, and Tyler Summers. 2021. Approximate Midpoint Policy Iteration for Linear Quadratic Control.” arXiv:2011.14212 [Cs, Eess, Math], February.
Grosse, Roger. 2021a. Metrics.” In CSC2541 Winter 2021, Chapter 3.
———. 2021b. Second-Order Optimization.” In CSC2541 Winter 2021, Chapter 4.
———. 2021c. Taylor Approximations.” In CSC2541 Winter 2021, Chapter 2.
Gupta, Vineet, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned Stochastic Tensor Optimization.” In Proceedings of the 35th International Conference on Machine Learning, 1842–50. PMLR.
Holl, Philipp, Vladlen Koltun, and Nils Thuerey. 2021. Physical Gradients for Deep Learning.” arXiv.
Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021. Improving Predictions of Bayesian Neural Nets via Local Linearization.” In International Conference on Artificial Intelligence and Statistics, 703–11. PMLR.
Lesage-Landry, Antoine, Joshua A. Taylor, and Iman Shames. 2021. Second-Order Online Nonconvex Optimization.” IEEE Transactions on Automatic Control 66 (10): 4866–72.
Madsen, K, H.B. Nielsen, and O. Tingleff. 2004. Methods for Non-Linear Least Squares Problems.”
Martens, James. 2010. Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735–42. ICML’10. USA: Omnipress.
———. 2020. New Insights and Perspectives on the Natural Gradient Method.” Journal of Machine Learning Research 21 (146): 1–76.
Martens, James, and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” In Proceedings of the 32nd International Conference on Machine Learning, 2408–17. PMLR.
Martens, James, and Ilya Sutskever. 2011. Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress.
———. 2012. Training Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade, 479–535. Lecture Notes in Computer Science. Springer.
Nesterov, Yu. 2007. Accelerating the Cubic Regularization of Newton’s Method on Convex Problems.” Mathematical Programming 112 (1): 159–81.
Nocedal, Jorge, and S. Wright. 2006. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer-Verlag.
Nocedal, Jorge, and Ya-xiang Yuan. 1993. Analysis of a Self-Scaling Quasi-Newton Method.” Mathematical Programming 61 (1-3): 19–37.
Pavlov, Andrei, Iman Shames, and Chris Manzie. 2020. Interior Point Differential Dynamic Programming.” arXiv:2004.12710 [Cs, Eess, Math], October.
Pilanci, Mert, and Martin J. Wainwright. 2016. Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares.” Journal of Machine Learning Research 17 (53): 1–38.
Ruppert, David. 1985. A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.” The Annals of Statistics 13 (1): 236–45.
Schmidt, Mark, Glenn Fung, and Romer Rosales. 2009. “Optimization Methods for L1-Regularization.” University of British Columbia, Technical Report TR-2009 19.
Schraudolph, Nicol N. 2002. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.” Neural Computation 14 (7): 1723–38.
Transtrum, Mark K, Benjamin B Machta, and James P Sethna. 2011. The Geometry of Nonlinear Least Squares with Applications to Sloppy Models and Optimization.” Physical Review E 83 (3): 036701.
Wilkinson, William J., Simo Särkkä, and Arno Solin. 2021. Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees.” arXiv.
Wright, John, and Yi Ma. 2022. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. S.l.: Cambridge University Press.
Wright, Stephen J., and Benjamin Recht. 2021. Optimization for Data Analysis. New York: Cambridge University Press.
Yao, Zhewei, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2020. PyHessian: Neural Networks Through the Lens of the Hessian.” In arXiv:1912.07145 [Cs, Math].

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.