Optimal transport metrics
Wasserstein distances, Monge-Kantorovich metrics, Earthmover distances
2019-05-30 — 2021-06-08
Wherein the Wasserstein distance is introduced as a metric on probability measures, p=1 duality is noted, and an entropy‑regularised Sinkhorn variant is presented for fast computation on histograms.
danger: I am halfway through editing this! Notation inconsistencies and indeed plain falsehoods abound.
I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.
Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\),
\[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]
Then the \(p^{\text{th}}\) Wasserstein distance between two probability measures \(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as
\[W_{p} (P, Q ):= \left( \inf_{\gamma \in \Pi (P, Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\] where \(\Pi(P, Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.
We can work equivalently with RVs distributed according to these measures, respectively \(X\sim P,Y\sim Q,\) and then we consider \(\gamma\) to range over joint distributions for these RVs, so that \[W_p(X,Y):=W_p(P;Q):=\inf_{\gamma}\mathbb{E}[d(X,Y)^p]^{1/p}\]
Also, the cost is usually just the Euclidean/\(L_2\) distance, so that \[\begin{aligned} W_p(X,Y):=W_p(P;Q) &:=\inf_{\gamma}\mathbb{E}[\Vert X-Y\Vert_2^p]^{1/p}. \end{aligned}\]
Practically, one usually sees \(p\in\{1,2\}\). Maybe \(p=\infty\) is interesting also (Champion, De Pascale, and Juutinen 2008). For \(p=1\) we have an extremely useful representation in terms of optimisation over functions, the Kantorovich-Rubinstein duality. \[ W_1(P, Q)=\sup _{f,g:f(x)+g(y)\leq d(x,y)} \mathbb{E}_{x \sim P}[f(x)]-\mathbb{E}_{x \sim Q}[g(x)]. \] For Euclidean distance this optimisation ranges over 1-Lipschitz functions, \[ W_1(P, Q)=\sup _{\|f\|_{L} \leq 1} \mathbb{E}_{x \sim P}[f(x)]-\mathbb{E}_{x \sim Q}[f(x)]. \] TODO: check that.
The Wasserstein distance between two objects is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.
🏗 discuss favourable properties of this metric (triangle inequality, bounds on moments etc).
But why do we care about such an intractable distance? Because it gives good error bounds for sufficiently nice expectations. We know that if \(W_p(X,Y) \leq \epsilon\), then for any L-Lipschitz function \(f\), \(|f(X) - f(Y)| \leq L\epsilon.\)
1 Analytic expressions
1.1 Gaussian
Useful: Two Gaussians may be related thusly (Givens and Shortt 1984) for a Wasserstein-2 for \(X\sim\nu\), \(Y\sim\mu\).
\[\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
1.2 Any others?
None that I know.
2 Kantorovich-Rubinstein duality
Vincent Hermann gives an excellent practical introduction.
3 “Neural Net distance”
Wasserstein distance with a baked-in notion of the capacity of the function class which approximates the true Wasserstein (Arora et al. 2017). Is this actually used?
4 Fisher distance
Specifically \((p,\nu)\)-Fisher distances, in the terminology of (J. H. Huggins et al. 2018b). They use these distances as a computationally tractable proxy (in fact, bound) for Wasserstein distances during inference. See Fisher distances.
5 Sinkhorn divergence
A regularised version of a Wasserstein metric (Cuturi 2013).
\[W_{p,\eta} (P, Q )^p:= \inf_{\gamma \in \Pi (P, Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) -H(\gamma).\]
Here \(H\) is the entropy.
The examples I saw seem to be only applied to measures over finite sets (i.e. histograms, weighted point sets), where there are many neat tricks to make calculations tractable. Can it be more general? (Altschuler et al. 2019; Blanchet et al. 2018)
TBD.
6 Introductions
(Altschuler et al. 2019; Carlier et al. 2017; Thorpe 2018) have been recommended to me as compact modern introductions.
Peyré’s course to accompany Peyré and Cuturi (2019) has been recommended to me and comes with course notes.
Or the bibliography in POT: Python Optimal Transport.
7 Use in inference
8 Connection to Maximum mean discrepancy
Feydy et al. (2019) connects OT to maximum mean discrepancy.
9 Incoming
Fundamental theorem of optimal transport.
Michele Coscia’s paper using a graph Laplacian to calculate an approximate Earth mover distance over a graph topology. Buzzword use case: interpretably inferring graph transmission rate of a disease. This looks simple; surely it must be a known result in optimal transport metric studies?