# The Gaussian distribution

## Also Erf and normality and such

Stunts with Gaussian distributions.

Let’s start here with the basic thing. The (univariate) standard Gaussian pdf

$\psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right)$

We define

$\Psi:x\mapsto \int_{-\infty}^x\psi(t) dt$

More generally we define

$\phi(x; \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}$

In the multivariate case, where the covariance $$\Sigma$$ is strictly positive definite we can consider density of the general normal distribution over $$\mathbb{R}^k$$ as

$\psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det({ {\Sigma }})^{-{\frac {1}{2}}}\,e^{-{\frac {1}{2}}( {x} -{ {\mu }})^{\!{\top}}{ {\Sigma }}^{-1}( {x} -{ {\mu }})}$

if a random variable $$Y$$ has a Gaussian distribution with parameters $$\mu, \Sigma$$, we write

$Y \sim \mathcal{N}(\mu, \Sigma)$

## What is Erf again?

This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to give a slightly differently ambiguous name to the already ambiguously named normal density.

But I can never remember what it is. There are scaling factors tacked on.

$\operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt$

which is to say

\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}

Done.

## Left tail of icdf

For small $$p$$, the quantile function has the useful asymptotic expansion

$\Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1).$

### ODE representation for the univariate density

\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}

With initial conditions

\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}

🏗 note where I learned this.

### ODE representation for the univariate icdf

From (Steinbrecher and Shaw 2008) via Wikipedia.

Let us write $$w:=\Psi^{-1}$$ to suppress keep notation clear.

\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}

With initial conditions

\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}

### Density PDE representation as a diffusion equation

(Botev, Grotowski, and Kroese 2010)

\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}

Look, it’s the diffusion equation of Wiener process. Surprise.

🏗

## Roughness

Univariate -

\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}

## Multidimensional marginals and conditionals

As made famous by Wiener processes in finance and Gaussian processes in Bayesian nonparametrics.

See, e.g. these lectures, or Michael I Jordan’s backgrounders.

### Transformed variables

Special case.

$Y \sim \mathcal{N}(X\beta, I)$

implies

$W^{1/2}Y \sim \mathcal{N}(W^{1/2}X\beta, W)$

## Metrics

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to approximate them in probability metrics.

### Wasserstein

Useful: Two Gaussians may be related thusly in Wasserstein-2 distance, i.e. $$W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}$$ for $$X\sim\nu$$, $$Y\sim\mu$$.

\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}

In the centred case this is simply

\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}

(Givens and Shortt 1984)

### Kullback-Leibler

Pulled from wikipedia:

$D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).$

In the centred case this reduces to

$D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).$

### Hellinger

Djalil defines both Hellinger distance

$\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.$

and Hellinger affinity

$\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).$

For Gaussians we can find this exactly:

$\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),$

In multiple dimensions:

$\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).$

Botev, Z. I. 2017. “The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1): 125–48. https://doi.org/10.1111/rssb.12162.

Botev, Z. I., J. F. Grotowski, and D. P. Kroese. 2010. “Kernel Density Estimation via Diffusion.” The Annals of Statistics 38 (5): 2916–57. https://doi.org/10.1214/10-AOS799.

Givens, Clark R., and Rae Michael Shortt. 1984. “A Class of Wasserstein Metrics for Probability Distributions.” The Michigan Mathematical Journal 31 (2): 231–40. https://doi.org/10.1307/mmj/1029003026.

Richards, Winston A., Robin S, Ashok Sahai, and M. Raghunadh Acharya. 2010. “An Efficient Polynomial Approximation to the Normal Distribution Function and Its Inverse Function.” Journal of Mathematics Research 2 (4): p47. https://doi.org/10.5539/jmr.v2n4p47.

Roy, Paramita, and Amit Choudhury. 2012. “Approximate Evaluation of Cumulative Distribution Function of Central Sampling Distributions: A Review.” Electronic Journal of Applied Statistical Analysis 5 (1). https://doi.org/10.1285/i20705948v5n1p121.

Steinbrecher, György, and William T. Shaw. 2008. “Quantile Mechanics.” European Journal of Applied Mathematics 19 (2): 87–112. https://doi.org/10.1017/S0956792508007341.

Strecok, Anthony. 1968. “On the Calculation of the Inverse of the Error Function.” Mathematics of Computation 22 (101): 144–58. https://doi.org/10.1090/S0025-5718-1968-0223070-2.

Wichura, Michael J. 1988. “Algorithm AS 241: The Percentage Points of the Normal Distribution.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 37 (3): 477–84. https://doi.org/10.2307/2347330.