The Gaussian distribution

Also Erf and normality and such


Stunts with Gaussian distributions.

Let’s start here with the basic thing. The (univariate) standard Gaussian pdf

\[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right) \]

We define

\[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt \]

More generally we define

\[ \phi(x; \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \]

In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite we can consider density of the general normal distribution over \(\mathbb{R}^k\) as

\[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det({ {\Sigma }})^{-{\frac {1}{2}}}\,e^{-{\frac {1}{2}}( {x} -{ {\mu }})^{\!{\top}}{ {\Sigma }}^{-1}( {x} -{ {\mu }})} \]

if a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write

\[Y \sim \mathcal{N}(\mu, \Sigma)\]

What is Erf again?

This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to give a slightly differently ambiguous name to the already ambiguously named normal density.

But I can never remember what it is. There are scaling factors tacked on.

\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \]

which is to say

\[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]

Done.

Representations of density and CDF

Left tail of icdf

For small \(p\), the quantile function has the useful asymptotic expansion

\[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]

ODE representation for the univariate density

\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]

🏗 note where I learned this.

ODE representation for the univariate icdf

From (Steinbrecher and Shaw 2008) via Wikipedia.

Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.

\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]

Density PDE representation as a diffusion equation

(Botev, Grotowski, and Kroese 2010)

\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]

Look, it’s the diffusion equation of Wiener process. Surprise.

Rational approximations

🏗

Roughness

Univariate -

\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]

Multidimensional marginals and conditionals

As made famous by Wiener processes in finance and Gaussian processes in Bayesian nonparametrics.

See, e.g. these lectures, or Michael I Jordan’s backgrounders.

Transformed variables

Special case.

\[ Y \sim \mathcal{N}(X\beta, I) \]

implies

\[ W^{1/2}Y \sim \mathcal{N}(W^{1/2}X\beta, W) \]

Metrics

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to approximate them in probability metrics.

Wasserstein

Useful: Two Gaussians may be related thusly in Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

In the centred case this is simply

\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

(Givens and Shortt 1984)

Kullback-Leibler

Pulled from wikipedia:

\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

In the centred case this reduces to

\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

Hellinger

Djalil defines both Hellinger distance

\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]

and Hellinger affinity

\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]

For Gaussians we can find this exactly:

\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]

In multiple dimensions:

\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]

Botev, Z. I. 2017. “The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1): 125–48. https://doi.org/10.1111/rssb.12162.

Botev, Z. I., J. F. Grotowski, and D. P. Kroese. 2010. “Kernel Density Estimation via Diffusion.” The Annals of Statistics 38 (5): 2916–57. https://doi.org/10.1214/10-AOS799.

Givens, Clark R., and Rae Michael Shortt. 1984. “A Class of Wasserstein Metrics for Probability Distributions.” The Michigan Mathematical Journal 31 (2): 231–40. https://doi.org/10.1307/mmj/1029003026.

Richards, Winston A., Robin S, Ashok Sahai, and M. Raghunadh Acharya. 2010. “An Efficient Polynomial Approximation to the Normal Distribution Function and Its Inverse Function.” Journal of Mathematics Research 2 (4): p47. https://doi.org/10.5539/jmr.v2n4p47.

Roy, Paramita, and Amit Choudhury. 2012. “Approximate Evaluation of Cumulative Distribution Function of Central Sampling Distributions: A Review.” Electronic Journal of Applied Statistical Analysis 5 (1). https://doi.org/10.1285/i20705948v5n1p121.

Steinbrecher, György, and William T. Shaw. 2008. “Quantile Mechanics.” European Journal of Applied Mathematics 19 (2): 87–112. https://doi.org/10.1017/S0956792508007341.

Strecok, Anthony. 1968. “On the Calculation of the Inverse of the Error Function.” Mathematics of Computation 22 (101): 144–58. https://doi.org/10.1090/S0025-5718-1968-0223070-2.

Wichura, Michael J. 1988. “Algorithm AS 241: The Percentage Points of the Normal Distribution.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 37 (3): 477–84. https://doi.org/10.2307/2347330.