Gradient descent, Newton-like, stochastic

Stochastic Newton-type optimization, unlike deterministic Newton optimisation, uses noisy (possibly approximate) 2nd-order gradient information to find the argument which minimises

\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]

for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).

Subsampling

Most of the good tricks here are set up for ML-style training losses where the bottleneck is summing a large number of loss functions.

LiSSA attempts to make 2nd order gradient descent methods scale to large parameter sets (Agarwal, Bullins, and Hazan 2016):

a linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since \(H^{t+1}y^t\) can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

(Kovalev, Mishchenko, and Richtárik 2019) uses a decomposition of the objective into a sum of simple functions (the classic SGD setup for neural nets, typical of an online optimisation.

What does Francis Bach’s finite sample guarantee research get us in this setting? (Bach and Moulines 2011, 2013)

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations (‘large n’) and each of these is large (‘large p’). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

General case

Rather than observing \(\nabla f, \nabla^2 f\) we observe some random variables \(G(x),H(x)\) with \(bb{E}G=\nabla f\) and \(\bb{E}(H)=\nabla^2 f,\) not necessarily decomposable into a sum.

🏗

For now, see optimisation of experiments which is an interesting application of this problem.

Agarwal, Naman, Brian Bullins, and Elad Hazan. 2016. “Second Order Stochastic Optimization in Linear Time,” February. http://arxiv.org/abs/1602.03943.

Arnold, Sébastien M. R., and Chunming Wang. 2017. “Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix.” In. http://arxiv.org/abs/1709.05069.

Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations,” November. https://openreview.net/forum?id=SkkTMpjex.

Bach, Francis, and Eric Moulines. 2011. “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.” In Advances in Neural Information Processing Systems (NIPS), –. Spain. http://hal.archives-ouvertes.fr/hal-00608041.

Bach, Francis R., and Eric Moulines. 2013. “Non-Strongly-Convex Smooth Stochastic Approximation with Convergence Rate O(1/N).” In arXiv:1306.2119 [Cs, Math, Stat], 773–81. https://arxiv.org/abs/1306.2119v1.

Battiti, Roberto. 1992. “First-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” Neural Computation 4 (2): 141–66. https://doi.org/10.1162/neco.1992.4.2.141.

Bordes, Antoine, Léon Bottou, and Patrick Gallinari. 2009. “SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.” Journal of Machine Learning Research 10 (December): 1737–54. http://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf.

Bottou, Léon. 2012. “Stochastic Gradient Descent Tricks.” In Neural Networks: Tricks of the Trade, 421–36. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_25.

Byrd, R. H., S. L. Hansen, Jorge. Nocedal, and Y. Singer. 2016. “A Stochastic Quasi-Newton Method for Large-Scale Optimization.” SIAM Journal on Optimization 26 (2): 1008–31. https://doi.org/10.1137/140954362.

Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. “Hessian-Free Optimization for Learning Deep Multidimensional Recurrent Neural Networks.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1509.03475.

Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc. http://arxiv.org/abs/1406.2572.

Kovalev, Dmitry, Konstantin Mishchenko, and Peter Richtárik. 2019. “Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates,” December. http://arxiv.org/abs/1912.01597.

Lucchi, Aurelien, Brian McWilliams, and Thomas Hofmann. 2015. “A Variance Reduced Stochastic Newton Method,” March. http://arxiv.org/abs/1503.08316.

Martens, James. 2010. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735–42. ICML’10. USA: Omnipress. http://www.cs.utoronto.ca/~jmartens/docs/Deep_HessianFree.pdf.

Martens, James, and Ilya Sutskever. 2011. “Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress. http://dl.acm.org/citation.cfm?id=3104482.3104612.

———. 2012. “Training Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade, 479–535. Lecture Notes in Computer Science. Springer. http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf.

Robbins, H., and D. Siegmund. 1971. “A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.” In Optimizing Methods in Statistics, edited by Jagdish S. Rustagi, 233–57. Academic Press. https://doi.org/10.1016/B978-0-12-604550-5.50015-8.

Ruppert, David. 1985. “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.” The Annals of Statistics 13 (1): 236–45. https://doi.org/10.1214/aos/1176346589.

Schraudolph, Nicol N., Jin Yu, and Simon Günter. 2007. “A Stochastic Quasi-Newton Method for Online Convex Optimization.” In Artificial Intelligence and Statistics, 436–43. http://proceedings.mlr.press/v2/schraudolph07a.html.