Optimisation
October 4, 2014 — April 11, 2024
Crawling through alien landscapes in the fog, looking for mountain peaks.
I’m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.
A vast topic, with many sub-topics. I have neither the time nor the expertise to construct a detailed map of these As Moritz Hardt observes (and this is just in the convex context),
It’s easy to spend a semester of convex optimization on various guises of gradient descent alone. Simply pick one of the following variants and work through the specifics of the analysis: conjugate, accelerated, projected, conditional, mirrored, stochastic, coordinate, online. This is to name a few. You may also choose various pairs of attributes such as “accelerated coordinate” descent. Many triples are also valid such as “online stochastic mirror” descent. An expert unlike me would know exactly which triples are admissible. You get extra credit when you use “subgradient” instead of “gradient”. This is really only the beginning of optimization and it might already seem confusing.
When I was even younger and yet more foolish I decided the divide was between online optimization and offline optimization, which in hindsight is neither a clear nor useful taxonomy for the problems facing me. Now there are more tightly topical pages, such as gradient descent, and 2nd order methods, surrogate optimisation, constrained optimisation, and I shall create additional such as circumstances demand.
TODO: insert brief taxonomy here.
🏗 Diagram.
See Zeyuan Allen-Zhu and Elad Hazan on their teaching strategy which splits it into 16 different areas:
The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.
[…] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strong-convexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)
This year I’ve tried something different in COS511 @ Princeton, which turns out also to have research significance. We’ve covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.
Classical Strong-Convexity and Smoothness Reductions:
Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.
Given a non-strongly convex but smooth objective \(f\), define a objective by \(f_1(x)=f(x)+e\|x\|^2\).
It is straightforward to see that \(f_1\) differs from \(f\) by at most ϵ times a distance factor, and in addition it is ϵ-strongly convex. Thus, one can apply A to minimize \(f_1\) and get a solution which is not too far from the optimal solution for \(f\) itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.
Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000-parameter functions before everyone observes that I am a fool in the guise of a mathematician but everyone is not there because I wandered off the optimal path hours ago, and now I am alone and lost in a valley of lower-case Greek letters.
See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, nature-inspired “meta-heuristic” optimisation.
0.1 History
Grötschel (2012)
0.2 Brief intro material
- Luca Trevisan, Posts on online optimisation.
- Zeyuan ALLEN-ZHU: Recent Advances in Stochastic Convex and Non-Convex Optimization. Clear, has good pointers.
- Basic but enlightening, John Nash’s graphical explanation of R’s optimization
- Martin Jaggi’s Optimization in two hours
- Celebrated union of optimisation and economics, market complexity
0.3 Textbooks
Whole free textbooks online. Mostly convex.
- Madsen, Nielsen, and Tingleff (2004) is super simple for least-squares type optimisations
- Ben-Tal and Nemirovski (2001)
- Nemirovski (1996)
- Boyd and Vandenberghe’s influential Convex Optimization (S. P. Boyd and Vandenberghe 2004)
- Bubeck (2015) based on Bubeck’s course notes
- Elad Hazan’s Introduction to Online Convex Optimization (Hazan 2022)
0.4 Alternating Direction Method of Multipliers
Dunno. It’s everywhere, though. (S. Boyd 2010)
In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas—Rachford splitting, Spingarn’s method of partial inverses, Dykstra’s alternating projections, Bregman iterative algorithms for \(\ell_1\) problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.
0.5 Optimisation on manifolds
See Nicolas Boumen’s introductory blog post.
Optimization on manifolds is about solving problems of the form
\[\mathrm{minimize}_{x\in\mathcal{M}} f(x),\]
where \(\mathcal{M}\) is a nice, known manifold. By “nice”, I mean a smooth, finite-dimensional Riemannian manifold.
Practical examples include the following (and all possible products of these):
- Euclidean spaces
- The sphere (set of vectors or matrices with unit Euclidean norm)
- The Stiefel manifold (set of orthonormal matrices)
- The Grassmann manifold (set of linear subspaces of a given dimension; this is a quotient space)
- The rotation group (set of orthogonal matrices with determinant +1)
- The manifold of fixed-rank matrices
- The same, further restricted to positive semidefinite matrices
- The cone of (strictly) positive definite matrices
- …
Conceptually, the key point is to think of optimization on manifolds as unconstrained optimization: we do not think of \(\mathcal{M}\) as being embedded in a Euclidean space. Rather, we think of \(\mathcal{M}\) as being “the only thing that exists,” and we strive for intrinsic methods. Besides making for elegant theory, it also makes it clear how to handle abstract spaces numerically (such as the Grassmann manifold for example); and it gives algorithms the “right” invariances (computations do not depend on an arbitrarily chosen representation of the manifold).
There are at least two reasons why this class of problems is getting much attention lately. First, it is because optimization problems over the aforementioned sets (mostly matrix sets) come up pervasively in applications, and at some point it became clear that the intrinsic viewpoint leads to better algorithms, as compared to general-purpose constrained optimization methods (where \(\mathcal{M}\) is considered as being inside a Euclidean space \(\mathcal{E}\), and algorithms move in \(\mathcal{E}\), while penalising distance to \(\mathcal{M}\)). The second is that, as I will argue momentarily, Riemannian manifolds are “the right setting” to talk about unconstrained optimization. And indeed, there is a beautiful book by [Absil, Sepulchre, Mahony], called Optimization algorithms on matrix manifolds (freely available), that shows how the classical methods for unconstrained optimization (gradient descent, Newton, trust-regions, conjugate gradients…) carry over seamlessly to the more general Riemannian framework.
0.6 Gradient-free optimization
Not all the methods described here use gradient information, but it’s frequently assumed to be something you can access easily. It’s worth considering which objectives you can optimize easily
But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you can’t differentiate it since reality itself is usually not analytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this in the context of experiment design.
In general situations like this we use gradient-free methods, such as simulated annealing or numerical gradient etc.
0.6.1 Variational optimisation
(Bird, Kunze, and Barber 2018; Staines and Barber 2013, 2012)
0.6.2 “Meta-heuristic” methods
Biologically-inspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.
See biometic algorithms for the care and husbandry of such as those.
0.6.3 Annealing and Monte Carlo optimisation methods
Simulated annealing: Constructing a process to yield maximally-likely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But it’s generally useful. You don’t necessarily need a gradient here, just the ability to evaluate something interpretable as a “likelihood ratio”. Long story. I don’t yet cover this at Monte Carlo methods but I should.
Elad Hazan, The two cultures of optimization:
The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.
Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.
However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms. […]
In our recent paper (Abernethy and Hazan 2016), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.
0.6.4 Expectation maximization
0.7 Parallel
Classic, basic SGD takes walks through the data set example-wise or feature-wise — but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.
0.8 Implementations
Specialised optimisation software.
See also statistical software, and gradient descent
GENO (Soeren Laue, Mitterreiter, and Giesen 2019; Sören Laue, Blacher, and Giesen 2022)
GENO provides optimization solvers for everyone. You can enter your optimization problem in an easy-to-read modeling language in the code editor below. Python code is then generated automatically that can solve this class of optimization problems on the CPU or on the GPU. The automatically generated solvers are often as fast as handwritten, specialized solvers…
The GENO solver combines an Augmented Lagrangian approach with a limited memory quasi-Newton method (L-BFGS-B) that can also handle bound constraints on the variables. Quasi-Newton methods are very efficient for problems involving thousands of optimization variables. The GENO solver is then instantiated by the automatically generated methods for computing function values and gradients that are provided by this website to solve the specified class of optimization problems. This approach is very well suited for optimization problems originating from classical machine learning problems.
Looks useful for an interesting class of semidefinite programming problems.
ensmallen (Bhardwaj et al. 2021)
We present ensmallen, a fast and flexible C++ library for mathematical optimization of arbitrary user-supplied functions, which can be applied to many machine learning problems. Several types of optimizations are supported, including differentiable, separable, constrained, and categorical objective functions. The library provides many pre-built optimizers (including numerous variants of SGD and Quasi-Newton optimizers) as well as a flexible framework for implementing new optimizers and objective functions. Implementation of a new optimizer requires only one method and a new objective function typically requires one or two C++ functions. This can aid in the quick implementation and prototyping of new machine learning algorithms. Due to the use of C++ template metaprogramming, ensmallen is able to support compiler optimizations that provide fast runtimes. Empirical comparisons show that ensmallen is able to outperform other optimization frameworks (like Julia and SciPy), sometimes by large margins. The library is distributed under the BSD license and is ready for use in production environments.
SPORCO a Python package for solving optimisation problems with sparsity-inducing regularisation. These consist primarily of sparse coding and dictionary learning problems, including convolutional sparse coding and dictionary learning, but there is also support for other problems such as Total Variation regularisation and Robust PCA. In the current version, all of the optimisation algorithms are based on the Alternating Direction Method of Multipliers (ADMM).
scipy.optimise.minimize: The python default. Includes many different algorithms that can do whatever you want. Failure modes are opaque, online-only and they don’t support warm-restarts, which is a thing for me, but a good starting point unless you have reason to prefer others. (i.e. if all your data does not fit in RAM, don’t bother.)
-
SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems. Dictionary learning and matrix factorization (NMF, sparse PCA, …) Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods Solving structured sparse decomposition problems (\(ell_1/ell_2,\) \(\ell_1/\ell_\infty,\) sparse group lasso, tree-structured regularization structured sparsity with overlapping groups,…). It is developed by Julien Mairal, with the collaboration of Francis Bach, Jean Ponce, Guillermo Sapiro, Rodolphe Jenatton and Guillaume Obozinski. It is coded in C++ with a Matlab interface. Recently, interfaces for R and Python have been developed by Jean-Paul Chieze (INRIA), and archetypal analysis was written by Yuansi Chen (UC Berkeley).
-
…is a user-friendly interface to several conic and integer programming solvers, very much like YALMIP or CVX under MATLAB.
The main motivation for PICOS is to have the possibility to enter an optimization problem as a high-level model, and to be able to solve it with several different solvers. Multidimensional and matrix variables are handled in a natural fashion, which makes it painless to formulate an SDP or an SOCP. This is very useful for educational purposes, and to quickly implement some models and test their validity on simple examples.
also maintains a list of other solvers.
Manifold optimisation implementations (for e.g. learning on manifolds)
-
… is a free software package for convex optimization based on the Python programming language. It can be used with the interactive Python interpreter, on the command line by executing Python scripts, or integrated in other software via Python extension modules. Its main purpose is to make the development of software for convex optimization applications straightforward by building on Python’s extensive standard library and on the strengths of Python as a high-level programming language. […]
efficient Python classes for dense and sparse matrices (real and complex), with Python indexing and slicing and overloaded operations for matrix arithmetic
an interface to most of the double-precision real and complex BLAS
an interface to LAPACK routines for solving linear equations and least-squares problems, matrix factorisations (LU, Cholesky, LDLT and QR), symmetric eigenvalue and singular value decomposition, and Schur factorization
an interface to the fast Fourier transform routines from FFTW
interfaces to the sparse LU and Cholesky solvers from UMFPACK and CHOLMOD
routines for linear, second-order cone, and semidefinite programming problems
routines for nonlinear convex optimization
interfaces to the linear programming solver in GLPK, the semidefinite programming solver in DSDP5, and the linear, quadratic and second-order cone programming solvers in MOSEK
a modeling tool for specifying convex piecewise-linear optimization problems.
seems to reinvent half of numpy and scipy. Also seems to be used by all the other python packages.
-
Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating, solving, and analyzing optimization models.
A core capability of Pyomo is modeling structured optimization applications. Pyomo can be used to define general symbolic problems, create specific problem instances, and solve these instances using commercial and open-source solvers. Pyomo’s modeling objects are embedded within a full-featured high-level programming language providing a rich set of supporting libraries, which distinguishes Pyomo from other algebraic modeling languages like AMPL, AIMMS and GAMS.…
Pyomo was formerly released as the Coopr software library.
-
…is a Python-embedded modeling language for convex optimization problems. It allows you to express your problem in a natural way that follows the math, rather than in the restrictive standard form required by solvers.
So it’s a DSL for convex constraint programming. Can be extended heuristically to nonconvex constraints by…
-
… is a package for modeling and solving problems with convex objectives and decision variables from a nonconvex set. This package provides heuristics such as NC-ADMM (a variation of alternating direction method of multipliers for nonconvex problems) and relax-round-polish, which can be viewed as a majorization-minimization algorithm. The solver methods provided and the syntax for constructing problems are discussed in our associated paper.
-
… is a free/open-source library for nonlinear optimization, providing a common interface for a number of different free optimization routines available online as well as original implementations of various other algorithms. Its features include:
Callable from C, C++, Fortran, Matlab or GNU Octave, Python, GNU Guile, Julia, GNU R, Lua, and OCaml.
A common interface for many different algorithms—try a different algorithm just by changing one parameter.
Support for large-scale optimization (some algorithms scalable to millions of parameters and thousands of constraints)…
Algorithms using function values only (derivative-free) and also algorithms exploiting user-supplied gradients.
-
…(pronounced tee-fox) provides a set of Matlab templates, or building blocks, that can be used to construct efficient, customized solvers for a variety of convex models, including in particular those employed in sparse recovery applications. It was conceived and written by Stephen Becker, Emmanuel J. Candès and Michael Grant.
stan is famous for Monte Carlo sampling, but also does deterministic optimisation using automatic differentiation. this is a luxurious “full service” option, although with limited scope for customisation; Curious how it performs in very high dimensions, as L-BFGS does not scale forever.
Optimization algorithms:
Limited-memory BFGS (Stan’s default optimization algorithm)
BFGS
Laplace’s method for classical standard error estimates and approximate Bayesian posteriors
Optim.jl is a generic optimizer for julia
JuMP.jl is a domain-specific modeling language for mathematical optimization embedded in Julia. It currently supports a number of open-source and commercial solvers (Bonmin, Cbc, Clp, Couenne, CPLEX, ECOS, FICO Xpress, GLPK, Gurobi, Ipopt, KNITRO, MOSEK, NLopt, SCS, BARON) for a variety of problem classes, including linear programming, (mixed) integer programming, second-order conic programming, semidefinite programming, and nonlinear programming.
NLsolve.jl solves systems of nonlinear equations. […]
The package is also able to solve mixed complementarity problems, which are similar to systems of nonlinear equations, except that the equality to zero is allowed to become an inequality if some boundary condition is satisfied. See further below for a formal definition and the related commands.
Since there is some overlap between optimizers and nonlinear solvers, this package borrows some ideas from the Optim package, and depends on it for linesearch algorithms.
Many of these solvers optionally use commercial backends such as Mosek.
0.9 Non-convex
1 Incoming
- Elad Hazan and Satyan Kale’s tutorial on online convex optimisation.
- Elad Hazan’s Introduction to Online Convex Optimization.
- Suvrit Sra’s eye-bleeding ugly but pertinent introduction to this stuff
- Francis Bach’s slides on practical ML SGD.
1.0.1 Miscellaneous optimisation techniques suggested on Linkedin
The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche name-dropping Bregmann iteration, alternating method, augmented Lagrangian…
1.0.2 Primal/dual problems
🏗
1.0.3 Majorization-minorization
🏗
1.0.4 Difference-of-Convex-objectives
When your objective function is not convex but you can represent it in terms of convex functions, somehow or other, use DC optimisation. (Gasso, Rakotomamonjy, and Canu 2009) (I don’t think this guarantees you a global optimum, but rather faster convergence to a local one)