Newton-type optimization, unlike basic gradient descent, uses (possibly approximate) 2nd-order gradient information to find the argument which minimises

\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]

for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).

Optimization over arbitrary functions typically gets discussed in terms of
line-search and trust-region methods, both of which can be construed, AFAICT,
as second order methods.
Although I don’t know of a *first* order trust region method; what would that even be?
I think I don’t need to care about the distinction for the current purpose. We’ll
see.

Can one do *higher*-than-2nd order optimisation?
Of course.
In practice 2nd order is usually fast enough, and has enough complications to
keep one busy, so the greater complexity and per-iteration convergence rate of a
3rd order method does not usually offer an attractive trade-off.

Here are some scattered notes which make no claim to comprehensiveness, comprehensibility or coherence.

## Vanilla Newton methods

From school one might remember that classic so-called Newton method for iteratively improving a guessed location \(x_{k}\) for a root of a 1-dimensional function \(g:\mathbb{R}\to\mathbb{R}\):

\[ x_{k+1}=x_{k}-\frac{g\left(x_{k}\right)}{g'\left(x_{k}\right)} \]

.

Under the assumption that it is twice continuously differentiable,
we can find stationary points—zeros of the *derivative*—of a function \(f:\mathbb{R}\to\mathbb{R}\).
(i.e. extrema and saddle points, which might in fact be minima.)

\[ x_{k+1}=x_{k}-\frac{f'\left(x_{k}\right)}{f''\left(x_{k}\right)} \]

How and why does this work again? Loosely, the argument is that we can use the Taylor expansion of a function to produce a locally quadratic model for the function, which might be not at all globally quadratic.

Returning to the objective with a multidimensional argument \(f:\mathbb{R}^n\to\mathbb{R},\) Taylor’s theorem for twice continuously differentiable functions tells us that

\[\begin{aligned} f(x+p)&= f(x)+\nabla^T f (x+p)p +\frac{1}{2}p^T \nabla^2f(x+p)p + p^TR(x+p)p\\ \end{aligned}\] and \(\lim _{p \rightarrow 0} R(x+p)=0\).

Putting the remainder in Cauchy form, this tells us

\[ f(x+p)=f(x)+\nabla f^T(x)p+\frac{1}{2}p^T\nabla^2f(x+tp)p \] for some \(t\in(0,1)\).

Generally, we might take a local quadratic model for \(f\) at \(x+p\) as

\[ \nabla f(x+p)\simeq m_x(p)= f(x)+\nabla f^T(x)p+\frac{1}{2}p^T\nabla^2f(x)p \]

for some \(t\in(0,1)\).`

If we ignore some subtleties and imagined this quadratic model was true, it would have an extremum where the directional derivative of this model vanished, i.e. \(\nabla m_x(p)\equiv 0\) where \(\nabla m_x(p):=\nabla f(x)+\nabla^2 f(x)p\). This would be satisfied at

\[\begin{aligned} \nabla f(x)+\nabla^2 f(x)p &= 0\\ \nabla f(x)&=-\nabla^2 f(x)p\\ p&=-[\nabla^2 f(x)]^{-1}\nabla f(x)\\ \end{aligned}\]

From this we get the multidimensional Newton update rule, by choosing for each \(x_{k}\) the locally optimal \(p\) quadratic approximation:

\[ x_{k+1}= x_{k}-\left[ \nabla^2 f\left( x_{k}\right)\right]^{-1} \nabla f\left( x_{k}\right), n \geq 0 \]

In practice the quadratic model is not true (or we would be done in one step) and we would choose a more robust method that by using a trust region or line-search algorithm to handle the divergence between the model and its quadratic approximation. Either way the quadratic approximation would be used, which would mean that the inverse Hessian matrix \(\left[ \nabla^2 f\left( x_{k}\right)\right]^{-1}\) would be important.

🏗 Unconstrained assumption, need for saddle points not just extrema with Lagrange methods, positive-definiteness etc.

## Quasi Newton methods

Secant conditions and approximations to the Hessian.

Let’s say we are designing a second-order update method.

See e.g. Nocedal and Wright (2006).

In BFGS the approximate Hessian \(B_k\) is used in place of the true Hessian.

Given the displacement \(s_{k}\) and the change of gradients \(y_{k}\), the secant equation requires that the symmetric positive definite matrix \(B_{k+1}\) map \(s_{k}\) into \(y_{k}.\)

\[ H_k(x^{(k)} − x^{(k−1)}) = \nabla f(x^{(k)}) − \nabla f(x^{(k−1)}) \]

…

We hope to get from this the *secant condition*.

\[ B_{k+1} s_{k}=y_{k} \]

TODO: compare and contrast to the Wolfe conditions on step length.

🏗

## Hessian free

Second order optimisation that does not require the Hessian matrix to be given explicitly. I should work out the boundaries of this definition. Does it include Quasi Newton? Intersect with? Overlap? 🏗

Andre Gibiansky has an example for coders.

## Natural gradient descent.

I haven’t actually worked out what this is yet, nor whether it relates or is filed correctly there. See natural gradient descent.

## Stochastic

See stochastic 2nd order gradient descent for some bonus complications.

Allen-Zhu, Zeyuan, and Elad Hazan. 2016. “Optimal Black-Box Reductions Between Optimization Objectives.” In *Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1606–14. Curran Associates, Inc. http://papers.nips.cc/paper/6364-optimal-black-box-reductions-between-optimization-objectives.pdf.

Arnold, Sébastien M. R., and Chunming Wang. 2017. “Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix.” In. http://arxiv.org/abs/1709.05069.

Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations,” November. https://openreview.net/forum?id=SkkTMpjex.

Bach, Francis, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. 2012. “Optimization with Sparsity-Inducing Penalties.” *Foundations and Trends® in Machine Learning* 4 (1): 1–106. https://doi.org/10.1561/2200000015.

Battiti, Roberto. 1992. “First-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” *Neural Computation* 4 (2): 141–66. https://doi.org/10.1162/neco.1992.4.2.141.

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016. “Optimization Methods for Large-Scale Machine Learning,” June. http://arxiv.org/abs/1606.04838.

Boumal, Nicolas, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. 2014. “Manopt, a Matlab Toolbox for Optimization on Manifolds.” *Journal of Machine Learning Research* 15: 1455–9. http://jmlr.org/papers/v15/boumal14a.html.

Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. “Hessian-Free Optimization for Learning Deep Multidimensional Recurrent Neural Networks.” In *Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1509.03475.

Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In *Advances in Neural Information Processing Systems 27*, 2933–41. Curran Associates, Inc. http://arxiv.org/abs/1406.2572.

Dennis, J. E., and Robert B. Schnabel. 1989. “Chapter I A View of Unconstrained Optimization.” In *Handbooks in Operations Research and Management Science*, 1:1–72. Optimization. Elsevier. https://doi.org/10.1016/S0927-0507(89)01002-9.

Dennis, J., and R. Schnabel. 1996. *Numerical Methods for Unconstrained Optimization and Nonlinear Equations*. Vol. 16. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971200.

Gill, Jeff, and Gary King. 2004. “What to Do When Your Hessian Is Not Invertible: Alternatives to Model Respecification in Nonlinear Estimation.” *Sociological Methods & Research* 33 (1): 54–87. https://doi.org/10.1177/0049124103262681.

Madsen, K, H.B. Nielsen, and O. Tingleff. 2004. “Methods for Non-Linear Least Squares Problems.” http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3215/pdf/imm3215.pdf,

Martens, James. 2010. “Deep Learning via Hessian-Free Optimization.” In *Proceedings of the 27th International Conference on International Conference on Machine Learning*, 735–42. ICML’10. USA: Omnipress. http://www.cs.utoronto.ca/~jmartens/docs/Deep_HessianFree.pdf.

Martens, James, and Ilya Sutskever. 2011. “Learning Recurrent Neural Networks with Hessian-Free Optimization.” In *Proceedings of the 28th International Conference on International Conference on Machine Learning*, 1033–40. ICML’11. USA: Omnipress. http://dl.acm.org/citation.cfm?id=3104482.3104612.

———. 2012. “Training Deep and Recurrent Networks with Hessian-Free Optimization.” In *Neural Networks: Tricks of the Trade*, 479–535. Lecture Notes in Computer Science. Springer. http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf.

Nesterov, Yu. 2007. “Accelerating the Cubic Regularization of Newton’s Method on Convex Problems.” *Mathematical Programming* 112 (1): 159–81. https://doi.org/10.1007/s10107-006-0089-x.

Nocedal, Jorge, and S. Wright. 2006. *Numerical Optimization*. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer-Verlag. https://www.springer.com/gp/book/9780387303031.

Nocedal, Jorge, and Ya-xiang Yuan. 1993. “Analysis of a Self-Scaling Quasi-Newton Method.” *Mathematical Programming* 61 (1-3): 19–37. https://doi.org/10.1007/BF01582136.

Pilanci, Mert, and Martin J. Wainwright. 2016. “Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares.” *Journal of Machine Learning Research* 17 (53): 1–38. http://www.jmlr.org/papers/volume17/14-460/14-460.pdf.

Ruppert, David. 1985. “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.” *The Annals of Statistics* 13 (1): 236–45. https://doi.org/10.1214/aos/1176346589.

Schmidt, Mark, Glenn Fung, and Romer Rosales. 2009. “Optimization Methods for L1-Regularization.” *University of British Columbia, Technical Report TR-2009* 19.

Schraudolph, Nicol N. 2002. “Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.” *Neural Computation* 14 (7): 1723–38. https://doi.org/10.1162/08997660260028683.