# Optimal transport metrics

## Wasserstein distances, Monge-Kantorovich metrics, Earthmover distances

I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.

I am about to do a reading group based on Peyré’s course, so mayb be harmonising the notation here with that soon.

Let $$(M,d)$$ be a metric space for which every probability measure on $$M$$ is a Radon measure. For $$p\ge 1$$, let $$\mathcal{P}_p(M)$$ denote the collection of all probability measures $$P$$ on $$M$$ with finite $$p^{\text{th}}$$ moment for some $$x_0$$ in $$M$$,

$\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.$

Then the $$p^{\text{th}}$$ Wasserstein distance between two probability measures $$P$$ and $$Q$$ in $$\mathcal{P}_p(M)$$ is defined as

$W_{p} ( P , Q ):= \left( \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},$

where $$\Pi( P , Q )$$ denotes the collection of all measures on $$M \times M$$ with marginal distributions $$P$$ and $$Q$$ respectively.

Practically, one usually sees $$p\in\{1,2\}$$. For $$p=1$$ then

$W_1(P,Q)=\inf_{\gamma \in \Pi( P , Q )}\mathbb{E}_{(x,y)\sim \gamma}\left[d(x,y)\right]$

This is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.

🏗 discuss favourable properties of this metric (triangle inequality, bounds on moments etc).

But why do you are about such an intractable distance? Because gives good error bounds. We know that if $$W_p(\nu\hat{nu}) \leq \epsilon$$, then for any L-Lipschitz function $$\phi$$, $$|\nu(\phi) - \hat{\nu}(\phi)| \leq L\epsilon.$$ See for some specifics.

## Analytic expressions

### Gaussian

Useful: Two Gaussians may be related thusly for a Wasserstein-2 $$W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}$$ for $$X\sim\nu$$, $$Y\sim\mu$$.

\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}

## Kontorovich-Rubinstein duality

Vincent Hermann gives an excellent practical introduction.

## “Neural Net distance”

Wasserstein distance with a baked in notion of the capacity of the function class which approximate the true Wasserstein. Is this actually used?

## Fisher distance

Specifically $$(p,\nu)$$-Fisher distances, in the terminology of . They use these distances as a computationally tractable proxy (in fact, bound) for Wasserstein distances during inference. See Fisher distances.

## Sinkhorn divergence

A regularised version of a Wasserstein metric.

$W_{p,\eta} ( P , Q )^p:= \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) -H(\gamma).$

Here $$H$$ is the entropy.

In practice this seems to be only applied to measures over finite sets (i.e. histograms, weighted point sets), where there are many neat tricks to make calculations tractable.

TBD.

## Awaiting filing

Fundamental theorem of optimal transport. Michele Coscia’s new paper uses a graph Laplacian to calculate an approximate Earth mover distance over a graph topology. (buzzword use case: inferring graph transmission rate of a disease interpretably). This looks simple; surely it must be a known result in optimal transport metric studies?

## References

Altschuler, Jason, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. 2019. “Massively Scalable Sinkhorn Distances via the Nyström Method.” In Advances in Neural Information Processing Systems 32, 11. Curran Associates, Inc. https://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdf.
Ambrosio, Luigi, and Nicola Gigli. 2013. “A User’s Guide to Optimal Transport.” In Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle, edited by Luigi Ambrosio, Alberto Bressan, Dirk Helbing, Axel Klar, and Enrique Zuazua, 1–155. Lecture Notes in Mathematics. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-32160-3_1.
Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel. https://www.springer.com/gp/book/9783764387211.
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. “Wasserstein Generative Adversarial Networks.” In International Conference on Machine Learning, 214–23. http://proceedings.mlr.press/v70/arjovsky17a.html.
Arora, Sanjeev, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. “Generalization and Equilibrium in Generative Adversarial Nets (GANs).” March 1, 2017. http://arxiv.org/abs/1703.00573.
Arras, Benjamin, Ehsan Azmoodeh, Guillaume Poly, and Yvik Swan. 2017. “A Bound on the 2-Wasserstein Distance Between Linear Combinations of Independent Random Variables.” April 5, 2017. http://arxiv.org/abs/1704.01376.
Bachem, Olivier, Mario Lucic, and Andreas Krause. 2017. “Practical Coreset Constructions for Machine Learning.” 2017. https://arxiv.org/abs/1703.06476.
Blanchet, Jose, Lin Chen, and Xun Yu Zhou. 2018. “Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.” February 13, 2018. http://arxiv.org/abs/1802.04885.
Blanchet, Jose, Arun Jambulapati, Carson Kent, and Aaron Sidford. 2018. “Towards Optimal Running Times for Optimal Transport.” October 17, 2018. http://arxiv.org/abs/1810.07717.
Blanchet, Jose, Yang Kang, and Karthyek Murthy. 2016. “Robust Wasserstein Profile Inference and Applications to Machine Learning.” October 18, 2016. http://arxiv.org/abs/1610.05627.
Blanchet, Jose, Karthyek Murthy, and Nian Si. 2019. “Confidence Regions in Wasserstein Distributionally Robust Estimation.” June 4, 2019. http://arxiv.org/abs/1906.01614.
Blanchet, Jose, Karthyek Murthy, and Fan Zhang. 2018. “Optimal Transport Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes.” October 4, 2018. http://arxiv.org/abs/1810.02403.
Bolley, François, Ivan Gentil, and Arnaud Guillin. 2012. “Convergence to Equilibrium in Wasserstein Distance for FokkerPlanck Equations.” Journal of Functional Analysis 263 (8): 2430–57. https://doi.org/10.1016/j.jfa.2012.07.007.
Bolley, François, and Cédric Villani. 2005. “Weighted Csiszár-Kullback-Pinsker Inequalities and Applications to Transportation Inequalities.” Annales de La Faculté Des Sciences de Toulouse Mathématiques 14 (3): 331–52. https://doi.org/10.5802/afst.1095.
Canas, Guillermo D., and Lorenzo Rosasco. 2012. “Learning Probability Measures with Respect to Optimal Transport Metrics.” September 5, 2012. http://arxiv.org/abs/1209.1077.
Carlier, Guillaume, Marco Cuturi, Brendan Pass, and Carola Schoenlieb. 2017. “Optimal Transport Meets Probability, Statistics and Machine Learning,” 9.
Champion, T., L. De Pascale, and P. Juutinen. 2008. “The $\infty$-Wasserstein Distance: Local Solutions and Existence of Optimal Transport Maps.” SIAM Journal on Mathematical Analysis 40 (1): 1–20. https://doi.org/10.1137/07069938X.
Chizat, Lenaic, Edouard Oyallon, and Francis Bach. 2018. “On Lazy Training in Differentiable Programming.” December 19, 2018. http://arxiv.org/abs/1812.07956.
Chizat, Lénaïc, and Francis Bach. 2018. “On the Global Convergence of Gradient Descent for Over-Parameterized Models Using Optimal Transport.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 3036–46. Curran Associates, Inc. http://papers.nips.cc/paper/7567-on-the-global-convergence-of-gradient-descent-for-over-parameterized-models-using-optimal-transport.pdf.
Chu, Casey, Jose Blanchet, and Peter Glynn. 2019. “Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning.” January 30, 2019. http://arxiv.org/abs/1901.10691.
Coscia, Michele. 2020. “Generalized Euclidean Measure to Estimate Network Distances,” 11.
Cuturi, Marco. 2013. “Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.” In Advances in Neural Information Processing Systems 26. https://arxiv.org/abs/1306.0895v1.
Dowson, D. C., and B. V. Landau. 1982. “The Fréchet Distance Between Multivariate Normal Distributions.” Journal of Multivariate Analysis 12 (3): 450–55. https://doi.org/10.1016/0047-259X(82)90077-X.
Fernholz, Luisa Turrin. 1983. Von Mises Calculus for Statistical Functionals. Lecture Notes in Statistics 19. New York: Springer.
Frogner, Charlie, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. 2015. “Learning with a Wasserstein Loss.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2053–61. Curran Associates, Inc. http://papers.nips.cc/paper/5679-learning-with-a-wasserstein-loss.pdf.
Gao, Rui, and Anton J. Kleywegt. 2016. “Distributionally Robust Stochastic Optimization with Wasserstein Distance.” April 7, 2016. http://arxiv.org/abs/1604.02199.
Givens, Clark R., and Rae Michael Shortt. 1984. “A Class of Wasserstein Metrics for Probability Distributions.” The Michigan Mathematical Journal 31 (2): 231–40. https://doi.org/10.1307/mmj/1029003026.
Guo, Xin, Johnny Hong, Tianyi Lin, and Nan Yang. 2017. “Relaxed Wasserstein with Applications to GANs.” May 19, 2017. http://arxiv.org/abs/1705.07164.
Henry-Labordère, Pierre. 2017. Model-Free Hedging: A Martingale Optimal Transport Viewpoint. Chapman and Hall CRC Financial Mathematics 1.0. CRC Press.
Huggins, Jonathan H., Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. 2018a. “Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” June 26, 2018. http://arxiv.org/abs/1806.10234.
———. 2018b. “Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” September 25, 2018. http://arxiv.org/abs/1809.09505.
Huggins, Jonathan, Ryan P Adams, and Tamara Broderick. 2017. PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 3611–21. Curran Associates, Inc. http://papers.nips.cc/paper/6952-pass-glm-polynomial-approximate-sufficient-statistics-for-scalable-bayesian-glm-inference.pdf.
Ji, Kaiyi, and Yingbin Liang. 2018. “Minimax Estimation of Neural Net Distance,” November. https://arxiv.org/abs/1811.01054v1.
Knott, M., and C. S. Smith. 1984. “On the Optimal Mapping of Distributions.” Journal of Optimization Theory and Applications 43 (1): 39–49. https://doi.org/10.1007/BF00934745.
Kontorovich, Aryeh, and Maxim Raginsky. 2016. “Concentration of Measure Without Independence: A Unified Approach via the Martingale Method.” February 1, 2016. http://arxiv.org/abs/1602.00721.
Léonard, Christian. 2014. “A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport.” Discrete & Continuous Dynamical Systems - A 34 (4): 1533. https://doi.org/10.3934/dcds.2014.34.1533.
Liu, Huidong, Xianfeng Gu, and Dimitris Samaras. 2018. “A Two-Step Computation of the Exact GAN Wasserstein Distance.” In International Conference on Machine Learning, 3159–68. http://proceedings.mlr.press/v80/liu18d.html.
Liu, Qiang, and Dilin Wang. 2019. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems. http://arxiv.org/abs/1608.04471.
Mahdian, Saied, Jose Blanchet, and Peter Glynn. 2019. “Optimal Transport Relaxations with Application to Wasserstein GANs.” June 7, 2019. https://arxiv.org/abs/1906.03317v1.
Marzouk, Youssef, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016. “Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_23-1.
Maurya, Abhinav. 2018. “Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.
Mohajerin Esfahani, Peyman, and Daniel Kuhn. 2018. “Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations.” Mathematical Programming 171 (1): 115–66. https://doi.org/10.1007/s10107-017-1172-1.
Montavon, Grégoire, Klaus-Robert Müller, and Marco Cuturi. 2016. “Wasserstein Training of Restricted Boltzmann Machines.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3711–19. Curran Associates, Inc. http://papers.nips.cc/paper/6248-wasserstein-training-of-restricted-boltzmann-machines.pdf.
Olkin, I., and F. Pukelsheim. 1982. “The Distance Between Two Random Vectors with Given Dispersion Matrices.” Linear Algebra and Its Applications 48: 257–63. https://doi.org/10.1016/0024-3795(82)90112-4.
Panaretos, Victor M., and Yoav Zemel. 2019. “Statistical Aspects of Wasserstein Distances.” Annual Review of Statistics and Its Application 6 (1): 405–31. https://doi.org/10.1146/annurev-statistics-030718-104938.
Peyre, Gabriel. 2019. “Course Notes on Computational Optimal Transport,” 46.
Peyré, Gabriel, and Marco Cuturi. 2019. Computational Optimal Transport. Vol. 11. https://doi.org/10.1561/2200000073.
Rustamov, Raif M. 2019. “Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” January 10, 2019. http://arxiv.org/abs/1901.03227.
Santambrogio, Filippo. 2015. Optimal Transport for Applied Mathematicians. Edited by Filippo Santambrogio. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-20828-2_1.
Singh, Shashank, and Barnabás Póczos. 2018. “Minimax Distribution Estimation in Wasserstein Distance.” February 24, 2018. http://arxiv.org/abs/1802.08855.
Solomon, Justin, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. 2015. “Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains.” ACM Transactions on Graphics 34 (4): 66:1–11. https://doi.org/10.1145/2766963.
Spantini, Alessio, Daniele Bigoni, and Youssef Marzouk. 2017. “Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research 19 (66): 2639–709. http://arxiv.org/abs/1703.06131.
Takatsu, Asuka. 2008. “On Wasserstein Geometry of the Space of Gaussian Measures,” January. https://arxiv.org/abs/0801.2250v3.
Thorpe, Matthew. 2018. “Introduction to Optimal Transport,” 56.
Verdinelli, Isabella, and Larry Wasserman. 2019. “Hybrid Wasserstein Distance and Fast Distribution Clustering.” Electronic Journal of Statistics 13 (2): 5088–5119. https://doi.org/10.1214/19-EJS1639.
Villani, Cédric. 2009. Optimal Transport: Old and New. Grundlehren Der Mathematischen Wissenschaften. Berlin Heidelberg: Springer-Verlag. http://elenaher.dinauz.org/B07D.StFlour.pdf.
Weed, Jonathan, and Francis Bach. 2017. “Sharp Asymptotic and Finite-Sample Rates of Convergence of Empirical Measures in Wasserstein Distance.” June 30, 2017. http://arxiv.org/abs/1707.00087.
Zemel, Yoav. 2012. “Optimal Transportation: Continuous and Discrete.”
Zhu, B., J. Jiao, and D. Tse. 2020. “Deconstructing Generative Adversarial Networks.” IEEE Transactions on Information Theory 66 (11): 7155–79. https://doi.org/10.1109/TIT.2020.2983698.