danger: I am halfway through editing this! Notation inconsistencies and indeed plain falsehoods abound.
I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.
Encyclopedia of Math says:
Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\),
\[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]
Then the \(p^{\text{th}}\) Wasserstein distance between two probability measures \(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as
\[W_{p} (P, Q ):=
\left( \inf_{\gamma \in \Pi (P, Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\] where \(\Pi(P, Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.
We can work equivalently with RVs distributed according to these measures, respectively \(X\sim P,Y\sim Q,\) and then we consider \(\gamma\) to range over joint distributions for these RVs, so that \[W_p(X,Y):=W_p(P;Q):=\inf_{\gamma}\mathbb{E}[d(X,Y)^p]^{1/p}\]
Also, the cost is usually just the Euclidean/\(L_2\) distance, so that \[\begin{aligned}
W_p(X,Y):=W_p(P;Q)
&:=\inf_{\gamma}\mathbb{E}[\Vert X-Y\Vert_2^p]^{1/p}.
\end{aligned}\]
Practically, one usually sees \(p\in\{1,2\}\). Maybe \(p=\infty\) is interesting also (Champion, De Pascale, and Juutinen 2008). For \(p=1\) we have an extremely useful representation in terms of optimisation over functions, the Kantorovich-Rubinstein duality. \[
W_1(P, Q)=\sup _{f,g:f(x)+g(y)\leq d(x,y)} \mathbb{E}_{x \sim P}[f(x)]-\mathbb{E}_{x \sim Q}[g(x)].
\] For Euclidean distance this optimisation ranges over 1-Lipschitz functions, \[
W_1(P, Q)=\sup _{\|f\|_{L} \leq 1} \mathbb{E}_{x \sim P}[f(x)]-\mathbb{E}_{x \sim Q}[f(x)].
\] TODO: check that.
The Wasserstein distance between two objects is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.
🏗 discuss favourable properties of this metric (triangle inequality, bounds on moments etc).
But why do we care about such an intractable distance? Because it gives good error bounds for sufficiently nice expectations. We know that if \(W_p(X,Y) \leq \epsilon\), then for any L-Lipschitz function \(f\), \(|f(X) - f(Y)| \leq L\epsilon.\)
Analytic expressions
Gaussian
Useful: Two Gaussians may be related thusly (Givens and Shortt 1984) for a Wasserstein-2 for \(X\sim\nu\), \(Y\sim\mu\).
\[\begin{aligned}
d&:=
W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\
\Rightarrow d^2&=
\Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}).
\end{aligned}\]
Any others?
None that I know.
“Neural Net distance”
Wasserstein distance with a baked-in notion of the capacity of the function class which approximates the true Wasserstein (Arora et al. 2017). Is this actually used?
Fisher distance
Specifically \((p,\nu)\)-Fisher distances, in the terminology of (J. H. Huggins et al. 2018b). They use these distances as a computationally tractable proxy (in fact, bound) for Wasserstein distances during inference. See Fisher distances.
Sinkhorn divergence
A regularised version of a Wasserstein metric (Cuturi 2013).
\[W_{p,\eta} (P, Q )^p:=
\inf_{\gamma \in \Pi (P, Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) -H(\gamma).\]
Here \(H\) is the entropy.
The examples I saw seem to be only applied to measures over finite sets (i.e. histograms, weighted point sets), where there are many neat tricks to make calculations tractable. Can it be more general? (Altschuler et al. 2019; Blanchet et al. 2018)
TBD.
Incoming
Fundamental theorem of optimal transport.
Michele Coscia’s paper using a graph Laplacian to calculate an approximate Earth mover distance over a graph topology. Buzzword use case: interpretably inferring graph transmission rate of a disease. This looks simple; surely it must be a known result in optimal transport metric studies?
References
Altschuler, Bach, Rudi, et al. 2019.
“Massively Scalable Sinkhorn Distances via the Nyström Method.” In
Advances in Neural Information Processing Systems 32.
Ambrosio, and Gigli. 2013.
“A User’s Guide to Optimal Transport.” In
Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle. Lecture Notes in Mathematics.
Ambrosio, Gigli, and Savare. 2008.
Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Angenent, Haker, and Tannenbaum. 2003.
“Minimizing Flows for the Monge-Kantorovich Problem.” SIAM Journal on Mathematical Analysis.
Arjovsky, Chintala, and Bottou. 2017.
“Wasserstein Generative Adversarial Networks.” In
International Conference on Machine Learning.
Bachem, Lucic, and Krause. 2017.
“Practical Coreset Constructions for Machine Learning.” arXiv Preprint arXiv:1703.06476.
Blanchet, Jambulapati, Kent, et al. 2018.
“Towards Optimal Running Times for Optimal Transport.” arXiv:1810.07717 [Cs].
Carlier, Cuturi, Pass, et al. 2017. “Optimal Transport Meets Probability, Statistics and Machine Learning.”
Champion, De Pascale, and Juutinen. 2008.
“The ∞-Wasserstein Distance: Local Solutions and Existence of Optimal Transport Maps.” SIAM Journal on Mathematical Analysis.
Chizat, and Bach. 2018.
“On the Global Convergence of Gradient Descent for over-Parameterized Models Using Optimal Transport.” In
Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Chizat, Oyallon, and Bach. 2018.
“On Lazy Training in Differentiable Programming.” arXiv:1812.07956 [Cs, Math].
Coscia. 2020. “Generalized Euclidean Measure to Estimate Network Distances.”
Dowson, and Landau. 1982.
“The Fréchet Distance Between Multivariate Normal Distributions.” Journal of Multivariate Analysis.
Dudley. 2002. Real Analysis and Probability. Cambridge Studies in Advanced Mathematics 74.
Fatras, Zine, Flamary, et al. 2020.
“Learning with Minibatch Wasserstein : Asymptotic and Gradient Properties.” In
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.
Fernholz. 1983. von Mises calculus for statistical functionals. Lecture Notes in Statistics 19.
Feydy, Séjourné, Vialard, et al. 2019.
“Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.
Frogner, Zhang, Mobahi, et al. 2015.
“Learning with a Wasserstein Loss.” In
Advances in Neural Information Processing Systems 28.
Givens, and Shortt. 1984.
“A Class of Wasserstein Metrics for Probability Distributions.” The Michigan Mathematical Journal.
Gozlan, and Léonard. 2010.
“Transport Inequalities. A Survey.” arXiv:1003.3852 [Math].
Guo, Hong, Lin, et al. 2017.
“Relaxed Wasserstein with Applications to GANs.” arXiv:1705.07164 [Cs, Stat].
Henry-Labordère. 2017. Model-Free Hedging: A Martingale Optimal Transport Viewpoint. Chapman and Hall CRC Financial Mathematics 1.0.
Huggins, Jonathan, Adams, and Broderick. 2017.
“PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In
Advances in Neural Information Processing Systems 30.
Huggins, Jonathan H., Campbell, Kasprzak, et al. 2018a.
“Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat].
Kloeckner. 2021.
“Optimal Transportation and Stationary Measures for Iterated Function Systems.” Mathematical Proceedings of the Cambridge Philosophical Society.
Knott, and Smith. 1984.
“On the Optimal Mapping of Distributions.” Journal of Optimization Theory and Applications.
Liu, Huidong, Gu, and Samaras. 2018.
“A Two-Step Computation of the Exact GAN Wasserstein Distance.” In
International Conference on Machine Learning.
Liu, Qiang, and Wang. 2019.
“Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In
Advances In Neural Information Processing Systems.
Mahdian, Blanchet, and Glynn. 2019.
“Optimal Transport Relaxations with Application to Wasserstein GANs.” arXiv:1906.03317 [Cs, Math, Stat].
Marzouk, Moselhy, Parno, et al. 2016.
“Sampling via Measure Transport: An Introduction.” In
Handbook of Uncertainty Quantification.
Maurya. 2018. “Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.
Montavon, Müller, and Cuturi. 2016.
“Wasserstein Training of Restricted Boltzmann Machines.” In
Advances in Neural Information Processing Systems 29.
Panaretos, and Zemel. 2019.
“Statistical Aspects of Wasserstein Distances.” Annual Review of Statistics and Its Application.
———. 2020.
An Invitation to Statistics in Wasserstein Space. SpringerBriefs in Probability and Mathematical Statistics.
Peyre. 2019. “Course Notes on Computational Optimal Transport.”
Piccoli, and Rossi. 2016.
“On Properties of the Generalized Wasserstein Distance.” Archive for Rational Mechanics and Analysis.
Santambrogio. 2015.
Optimal Transport for Applied Mathematicians. Edited by Filippo Santambrogio. Progress in Nonlinear Differential Equations and Their Applications.
Singh, and Póczos. 2018.
“Minimax Distribution Estimation in Wasserstein Distance.” arXiv:1802.08855 [Cs, Math, Stat].
Spantini, Bigoni, and Marzouk. 2017.
“Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research.
Thorpe. 2018. “Introduction to Optimal Transport.”
Verdinelli, and Wasserman. 2019.
“Hybrid Wasserstein Distance and Fast Distribution Clustering.” Electronic Journal of Statistics.
Villani. 2009.
Optimal Transport: Old and New. Grundlehren Der Mathematischen Wissenschaften.
Zemel. 2012. “Optimal Transportation: Continuous and Discrete.”
Zhu, Jiao, and Tse. 2020.
“Deconstructing Generative Adversarial Networks.” IEEE Transactions on Information Theory.