Mirror descent

2019-12-29 — 2023-08-29

Wherein the optimization scheme is presented as a dual-space gradient step followed by a projection via a Bregman divergence, an entropic variant for the simplex is exhibited, and efficiency bounds mildly dependent on dimension are given.

Bregman

functional analysis

optimization

statmech

Beck and Teboulle (2003):

The mirror descent algorithm (MDA) was introduced by Nemirovsky and Yudin for solving convex optimization problems. This method exhibits an efficiency estimate that is mildly dependent on the decision variables dimension, and thus suitable for solving very large-scale optimization problems. We present a new derivation and analysis of this algorithm. We show that the MDA can be viewed as a nonlinear projected-subgradient type method, derived from using a general distance-like function instead of the usual Euclidean squared distance. Within this interpretation, we derive in a simple way convergence and efficiency estimates. We then propose an Entropic mirror descent algorithm for convex minimization over the unit simplex, with a global efficiency estimate proven to be mildly dependent on the dimension of the problem.

Bubeck’s lectures are good: Bubeck (2019).

Mirror Descent generalizes gradient descent to settings where the feasible set \(X\subset\mathbb R^r\) is not naturally Euclidean. It relies on:

A mirror map \(\Phi:C\to\mathbb R\), a strictly convex, differentiable function whose gradient \(\nabla\Phi\) maps the primal space \(X\) into the dual space \(\mathbb R^r\).
The associated Bregman divergence

\[ D_\Phi(p,q)=\Phi(p)-\Phi(q)-\langle\nabla\Phi(q),\,p-q\rangle. \]

Starting from \(x_0\in X\), each iteration does

\[ \nabla\Phi(y_{k+1}) \;=\;\nabla\Phi(x_k)-\eta\,g_k,\quad x_{k+1} = \arg\min_{x\in X}\;D_\Phi\bigl(x,y_{k+1}\bigr), \]

where \(g_k\in\partial f(x_k)\) . So that’s two steps, really, interleaved, a gradient descent in the dual space and then a projection in the primal space. Equivalently,

\[ x_{k+1} =\arg\min_{x\in X}\;\langle\eta\,g_k,\,x\rangle + D_\Phi(x,x_k). \]

As to why we’d bother: many reasons.

Mirror descend corresponds to what we want to do anyway, in situation where we have two natural representations of an estimand. So it’s easy.
It is theoretically very tractable, so it’s easy to prove stuff about it
It is fast, in that you can provably require few iterations to get very close to the optimum, and it’s a first order method (kinda) so you might hope that these iterations will be tractable.
It easily generalises to online/SGD settings where we observe data sequentially

Rarely is the obvious thing the best thing, in practical mathematics. AFAICT Mirror descent is a nearly obvious thing that is nearly the best.

1 Incoming

T Lienart, Mirror descent algorithm.
Xinhua Zhang notes
Nicholas Harvey notes
Sebastian Pokutta, Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning
Ch 17 of Gupta (2020) is very clear

2 References

Ajanthan, Gupta, Torr, et al. 2021. “Mirror Descent View for Neural Network Quantization.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Bansal, and Gupta. 2019. “Potential-Function Proofs for First-Order Methods.”

Beck, and Teboulle. 2003. “Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization.” Operations Research Letters.

Bubeck. 2015. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning.

———. 2019. The Five Miracles of Mirror Descent.

Crucinio. 2025. “A Mirror Descent Approach to Maximum Likelihood Estimation in Latent Variable Models.”

Gupta. 2020. “CMU 15-850 Advanced Algorithms.”

Jacobsen, and Cutkosky. 2022. “Parameter-Free Mirror Descent.”

Lee, Panageas, Piliouras, et al. 2017. “First-Order Methods Almost Always Avoid Saddle Points.” arXiv:1710.07406 [Cs, Math, Stat].

Wibisono, and Wilson. 2015. “On Accelerated Methods in Optimization.” arXiv:1509.03616 [Math].

Zhang. 2013. “Bregman Divergence and Mirror Descent.”