Distances between Gaussian distributions

Nearly equivalent to distances between symmetric positive definite matrices

June 27, 2016 — May 3, 2023

algebra
Gaussian
geometry
high d
linear algebra
measure
probability
signal processing
spheres
statistics
Figure 1

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how various probability metrics come out for them.

Since the “difficult” part of the problem is the distance between the covariances, this often ends up being the same, or at least closely related to the question of matrix norms, where the matrices in question are the positive (semi-)definite covariance/precision matrices.

1 Wasserstein

A useful analytic result about Wasserstein-2 distance, i.e. W2(μ;ν):=infE(XY22)1/2 for Xν, Yμ. Two Gaussians may be related thusly ():

d:=W2(N(μ1,Σ1);N(μ2,Σ2))d2=μ1μ222+tr(Σ1+Σ22(Σ11/2Σ2Σ11/2)1/2).

In the centred case this is even simpler:

d:=W2(N(0,Σ1);N(0,Σ2))d2=tr(Σ1+Σ22(Σ11/2Σ2Σ11/2)1/2).

2 Kullback-Leibler

Pulled from wikipedia:

DKL(N(μ1,Σ1)N(μ2,Σ2))=12(tr(Σ21Σ1)+(μ2μ1)TΣ21(μ2μ1)k+ln(detΣ2detΣ1)).

In the centred case this reduces to

DKL(N(0,Σ1)N(0,Σ2))=12(tr(Σ21Σ1)k+ln(detΣ2detΣ1)).

3 Hellinger

Djalil defines Hellinger distance

H(μ,ν)=fgL2(λ)=((fg)2dλ)1/2.

via Hellinger affinity

A(μ,ν)=fgdλ,H(μ,ν)2=22A(μ,ν).

For Gaussians, it apparently turns out that

A(N(m1,σ12),N(m2,σ22))=2σ1σ2σ12+σ22exp((m1m2)24(σ12+σ22)),

In multiple dimensions:

A(N(m1,Σ1),N(m2,Σ2))=det(Σ1Σ2)1/4det(Σ1+Σ22)1/2exp(Δm,(Σ1+Σ2)1Δm4).

4 “Natural” / geodesic

High-speed introduction to geometry of distances: Moakher and Batchelor (), which discusses the problems of comparing covariance matrices on a natural manifold.

The geodesic distance between P and Q in P(n) is given by Lang () dR(P,Q)=log(P1Q)F=[i=1nlog2λi(P1Q)]1/2, where λi(P1Q),1in are the eigenvalues of the matrix P1Q. Because P1Q is similar to P1/2QP1/2, the eigenvalues λi(P1Q) are all positive and hence [this] is well defined for all P and Q of P(n).

5 Maximum mean discrepancy

With the Gaussian kernel, there is a closed-form expression for the distance of an RV from the standard Gaussian ().

The computation of the MMD requires specifying a positive-definite kernel; in this paper we always assume it to be the Gaussian RBF kernel of width γ, namely, k(x,y)=exy2/(2γ2). Here, x,yRd, where d is the dimension of the code/latent space, and we use to denote the 2 norm. The population MMD can be most straightforwardly computed via the formula (): MMD2(P,Q)=Ex,xP[k(x,x)]2ExP,yQ[k(x,y)]+Ey,yQ[k(y,y)]

[U]sing the sample Qn={zi}i=1n of size n, we replace the last two terms by the sample average and the U-statistic respectively to obtain the unbiased estimator: MMDu2(Nd,Qn)=Ex,xNd[k(x,x)]2ni=1nExNd[k(x,zi)]+1n(n1)i=1njink(zi,zj) Our main result is the following proposition whose proof can be found in Appendix C: Proposition. The expectations in the expression above can be computed analytically to yield the formula: MMDu2(Nd,Qn)=(γ22+γ2)d/22n(γ21+γ2)d/2i=1nezi22(1+γ2)+1n(n1)i=1njinezizj22γ2

Extending this to arbitrary Gaussians, or analytic Gaussians, is left as an exercise.

Intuitively, we might prefer other kernels than the RBF if we are comparing Gaussians specifically; in particular, the RBF is “wasteful” in that it controls moments of all orders, whereas we might only care about the first two moments (mean and covariance), for Gaussians.

6 Infinite-dimensional Gaussian measures

Gets much weirder, but we care about this in Function-valued GPs. See (, , , ).

7 References

Givens, and Shortt. 1984. A Class of Wasserstein Metrics for Probability Distributions. The Michigan Mathematical Journal.
Gretton, Borgwardt, Rasch, et al. 2012. A Kernel Two-Sample Test.” The Journal of Machine Learning Research.
Lang. 1999. Fundamentals of Differential Geometry. Graduate Texts in Mathematics.
Magnus, and Neudecker. 2019. Matrix differential calculus with applications in statistics and econometrics. Wiley series in probability and statistics.
Meckes. 2009. On Stein’s Method for Multivariate Normal Approximation.” In High Dimensional Probability V: The Luminy Volume.
Minka. 2000. Old and new matrix algebra useful for statistics.
Moakher, and Batchelor. 2006. Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In Visualization and Processing of Tensor Fields.
Omladič, and Šemrl. 1990. “On the Distance Between Normal Matrices.” Proceedings of the American Mathematical Society.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Quang. 2021a. Convergence and Finite Sample Approximations of Entropic Regularized Wasserstein Distances in Gaussian and RKHS Settings.”
———. 2021b. Finite Sample Approximations of Exact and Entropic Wasserstein Distances Between Covariance Operators and Gaussian Processes.”
———. 2022. Kullback-Leibler and Renyi Divergences in Reproducing Kernel Hilbert Space and Gaussian Process Settings.”
———. 2023. Entropic Regularization of Wasserstein Distance Between Infinite-Dimensional Gaussian Measures and Gaussian Processes.” Journal of Theoretical Probability.
Rustamov. 2021. Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat.
Takatsu. 2008. On Wasserstein Geometry of the Space of Gaussian Measures.”
Zhang, Liu, Chen, et al. 2022. On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.”