I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.

I am about to do a reading group based on Peyré’s course, so mayb be harmonising the notation here with that soon.

Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\),

\[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]

Then the \(p^{\text{th}}\) *Wasserstein distance* between two probability measures
\(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as

\[W_{p} ( P , Q ):= \left( \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\]

where \(\Pi( P , Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.

Practically, one usually sees \(p\in\{1,2\}\). For \(p=1\) then

\[W_1(P,Q)=\inf_{\gamma \in \Pi( P , Q )}\mathbb{E}_{(x,y)\sim \gamma}\left[d(x,y)\right]\]

This is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.

🏗 discuss favourable properties of this metric (triangle inequality, bounds on moments etc).

But why do you are about such an intractable distance? Because gives good error bounds. We know that if \(W_p(\nu\hat{nu}) \leq \epsilon\), then for any L-Lipschitz function \(\phi\), \(|\nu(\phi) - \hat{\nu}(\phi)| \leq L\epsilon.\) See (J. H. Huggins et al. 2018b, 2018a) for some specifics.

## Analytic expressions

### Gaussian

Useful: Two Gaussians may be related thusly (Givens and Shortt 1984) for a Wasserstein-2 \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

## Kontorovich-Rubinstein duality

Vincent Hermann gives an excellent practical introduction.

## “Neural Net distance”

Wasserstein distance with a baked in notion of the capacity of the function class which approximate the true Wasserstein. (Arora et al. 2017) Is this actually used?

## Fisher distance

Specifically \((p,\nu)\)-Fisher distances, in the terminology of (J. H. Huggins et al. 2018b). They use these distances as a computationally tractable proxy (in fact, bound) for Wasserstein distances during inference. See Fisher distances.

## Sinkhorn divergence

A regularised version of a Wasserstein metric. (Cuturi 2013)

\[W_{p,\eta} ( P , Q )^p:= \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) -H(\gamma).\]

Here \(H\) is the entropy.

In practice this seems to be only applied to measures over finite sets (i.e. histograms, weighted point sets), where there are many neat tricks to make calculations tractable. (Altschuler et al. 2019; Blanchet et al. 2018)

TBD.

## Awaiting filing

Fundamental theorem of optimal transport. Michele Coscia’s new paper uses a graph Laplacian to calculate an approximate Earth mover distance over a graph topology. (buzzword use case: inferring graph transmission rate of a disease interpretably). This looks simple; surely it must be a known result in optimal transport metric studies?

## Recommended introductions.

(Altschuler et al. 2019; Carlier et al. 2017; Thorpe 2018) have been recommended to me as compact modern introductions.

Peyré’s course to accompany Peyré and Cuturi (2019) has been recommended to me and comes with course notes.

## References

*Advances in Neural Information Processing Systems 32*, 11. Curran Associates, Inc. https://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdf.

*Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle*, edited by Luigi Ambrosio, Alberto Bressan, Dirk Helbing, Axel Klar, and Enrique Zuazua, 1–155. Lecture Notes in Mathematics. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-32160-3_1.

*Gradient Flows: In Metric Spaces and in the Space of Probability Measures*. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel. https://www.springer.com/gp/book/9783764387211.

*International Conference on Machine Learning*, 214–23. http://proceedings.mlr.press/v70/arjovsky17a.html.

*Journal of Functional Analysis*263 (8): 2430–57. https://doi.org/10.1016/j.jfa.2012.07.007.

*Annales de La Faculté Des Sciences de Toulouse Mathématiques*14 (3): 331–52. https://doi.org/10.5802/afst.1095.

*SIAM Journal on Mathematical Analysis*40 (1): 1–20. https://doi.org/10.1137/07069938X.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 3036–46. Curran Associates, Inc. http://papers.nips.cc/paper/7567-on-the-global-convergence-of-gradient-descent-for-over-parameterized-models-using-optimal-transport.pdf.

*Advances in Neural Information Processing Systems 26*. https://arxiv.org/abs/1306.0895v1.

*Journal of Multivariate Analysis*12 (3): 450–55. https://doi.org/10.1016/0047-259X(82)90077-X.

*Von Mises Calculus for Statistical Functionals*. Lecture Notes in Statistics 19. New York: Springer.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2053–61. Curran Associates, Inc. http://papers.nips.cc/paper/5679-learning-with-a-wasserstein-loss.pdf.

*The Michigan Mathematical Journal*31 (2): 231–40. https://doi.org/10.1307/mmj/1029003026.

*Model-Free Hedging: A Martingale Optimal Transport Viewpoint*. Chapman and Hall CRC Financial Mathematics 1.0. CRC Press.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 3611–21. Curran Associates, Inc. http://papers.nips.cc/paper/6952-pass-glm-polynomial-approximate-sufficient-statistics-for-scalable-bayesian-glm-inference.pdf.

*Journal of Optimization Theory and Applications*43 (1): 39–49. https://doi.org/10.1007/BF00934745.

*Discrete & Continuous Dynamical Systems - A*34 (4): 1533. https://doi.org/10.3934/dcds.2014.34.1533.

*International Conference on Machine Learning*, 3159–68. http://proceedings.mlr.press/v80/liu18d.html.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1608.04471.

*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_23-1.

*Mathematical Programming*171 (1): 115–66. https://doi.org/10.1007/s10107-017-1172-1.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3711–19. Curran Associates, Inc. http://papers.nips.cc/paper/6248-wasserstein-training-of-restricted-boltzmann-machines.pdf.

*Linear Algebra and Its Applications*48: 257–63. https://doi.org/10.1016/0024-3795(82)90112-4.

*Annual Review of Statistics and Its Application*6 (1): 405–31. https://doi.org/10.1146/annurev-statistics-030718-104938.

*Computational Optimal Transport*. Vol. 11. https://doi.org/10.1561/2200000073.

*Optimal Transport for Applied Mathematicians*. Edited by Filippo Santambrogio. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-20828-2_1.

*ACM Transactions on Graphics*34 (4): 66:1–11. https://doi.org/10.1145/2766963.

*Journal of Machine Learning Research*19 (66): 2639–709. http://arxiv.org/abs/1703.06131.

*Electronic Journal of Statistics*13 (2): 5088–5119. https://doi.org/10.1214/19-EJS1639.

*Optimal Transport: Old and New*. Grundlehren Der Mathematischen Wissenschaften. Berlin Heidelberg: Springer-Verlag. http://elenaher.dinauz.org/B07D.StFlour.pdf.

*IEEE Transactions on Information Theory*66 (11): 7155–79. https://doi.org/10.1109/TIT.2020.2983698.