The Gaussian distribution

The least interesting probability distribution

Bell curves

Many facts about the useful, boring, ubiquitous Gaussian.

Density, CDF

The standard (univariate) Gaussian pdf is \[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right) \] Typically we allow a scale-location parameterised version \[ \phi(x; \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \] We will call the CDF \[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt. \] In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite we can write a density of the general normal distribution over \(\mathbb{R}^k\) as \[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det({ {\Sigma }})^{-{\frac {1}{2}}}\,e^{-{\frac {1}{2}}( {x} -{ {\mu }})^{\!{\top}}{ {\Sigma }}^{-1}( {x} -{ {\mu }})} \] If a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write \[Y \sim \mathcal{N}(\mu, \Sigma)\]

Differential representations

First, trivially, \(\phi'(x)=-\frac{e^{-\frac{x^2}{2}} x}{\sqrt{2 \pi }}.\)

Stein’s representation

Meckes (2009) explains Stein (1972)’s characterisation:

The normal distribution is the unique probability measure \(\mu\) for which \[ \int\left[f^{\prime}(x)-x f(x)\right] \mu(d x)=0 \] for all \(f\) for which the left-hand side exists and is finite. The operator \(T_{o}\) defined on \(C^{1}\) functions by \[ T_{o} f(x)=f^{\prime}(x)-x f(x) \] is called the \(characterizing operator\) of the standard normal distribution. The left inverse to \(T_{o},\) denoted \(U_{o},\) is defined by the equation \[ T_{o}\left(U_{o} f\right)(x)=f(x)-\mathbb{E} f(Z) \] where \(Z\) is a standard normal random variable.

This is incredibly useful in probability approximation by Gaussians where it justifies Stein’s method.

ODE representation for the univariate density

\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]

🏗 note where I learned this.

ODE representation for the univariate icdf

From (Steinbrecher and Shaw 2008) via Wikipedia.

Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.

\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]

Density PDE representation as a diffusion equation

Botev, Grotowski, and Kroese (2010) notes

\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]

Look, it’s the diffusion equation of Wiener process. Surprise! If you think about this for a while you end up discovering Feynman-Kac formulate.


For small \(p\), the quantile function has the asymptotic expansion \[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]

Orthogonal basis

Polynomial basis? You want the Hermite polynomials.

Rational function approximations



Univariate -

\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]


The normal distribution is the least “surprising” distribution in the sense that out of all distributions with a given mean and variance the Gaussian has the maximum entropy. Or maybe that is the most surprising, depending on your definition.

Multidimensional marginals and conditionals

Linear transforms of Gaussians are especially convenient. You could say that this is a definitional property of the Gaussian, in fact, and it arises conveniently from linear superposition. As made famous by Wiener processes in finance and Gaussian process regression in Bayesian nonparametrics.

See, e.g. these lectures, or Michael I Jordan’s backgrounders.

In practice I look up my favourite useful Gaussian identities in Petersen and Pedersen (2012) and so does everyone else I know.

Fourier representation

The Fourier transform/Characteristic function of a Gaussian is still Gaussian.

\[\mathbb{E}\exp (i\mathbf{t}\cdot \mathbf {X}) =\exp \left( i\mathbf {t} ^{\top}{\boldsymbol {\mu }}-{\tfrac {1}{2}}\mathbf {t} ^{\top}{\boldsymbol {\Sigma }}\mathbf {t} \right).\]

Transformed variables

Special case.

\[ Y \sim \mathcal{N}(X\beta, I) \]


\[ W^{1/2}Y \sim \mathcal{N}(W^{1/2}X\beta, W) \]

For more general transforms you could try polynomial chaos.


Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to approximate them in probability metrics.


Useful: Two Gaussians may be related thusly in Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

In the centred case this is simply (Givens and Shortt 1984)

\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]


Pulled from wikipedia:

\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

In the centred case this reduces to

\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]


Djalil defines both Hellinger distance

\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]

and Hellinger affinity

\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]

For Gaussians we can find this exactly:

\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]

In multiple dimensions:

\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]

What is Erf again?

This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to provide a slightly differently ambiguity to the one you are used to with the “normal” density. There are scaling factors tacked on.

\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \] which is to say \[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]


Botev, Z. I. 2017. “The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1): 125–48.
Botev, Z. I., J. F. Grotowski, and D. P. Kroese. 2010. “Kernel Density Estimation via Diffusion.” The Annals of Statistics 38 (5): 2916–57.
Givens, Clark R., and Rae Michael Shortt. 1984. “A Class of Wasserstein Metrics for Probability Distributions.” The Michigan Mathematical Journal 31 (2): 231–40.
Magnus, Jan R., and Heinz Neudecker. 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics. Rev. ed. New York: John Wiley.
Meckes, Elizabeth. 2009. “On Stein’s Method for Multivariate Normal Approximation.” In High Dimensional Probability V: The Luminy Volume, 153–78. Beachwood, Ohio, USA: Institute of Mathematical Statistics.
Minka, Thomas P. 2013. Old and New Matrix Algebra Useful for Statistics.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. “The Matrix Cookbook.”
Richards, Winston A., Robin S, Ashok Sahai, and M. Raghunadh Acharya. 2010. “An Efficient Polynomial Approximation to the Normal Distribution Function and Its Inverse Function.” Journal of Mathematics Research 2 (4): p47.
Roy, Paramita, and Amit Choudhury. 2012. “Approximate Evaluation of Cumulative Distribution Function of Central Sampling Distributions: A Review.” Electronic Journal of Applied Statistical Analysis 5 (1).
Stein, Charles. 1972. “A Bound for the Error in the Normal Approximation to the Distribution of a Sum of Dependent Random Variables.” Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, January, 583–602.
———. 1986. Approximate Computation of Expectations. Vol. 7. IMS.
Steinbrecher, György, and William T. Shaw. 2008. “Quantile Mechanics.” European Journal of Applied Mathematics 19 (2): 87–112.
Strecok, Anthony. 1968. “On the Calculation of the Inverse of the Error Function.” Mathematics of Computation 22 (101): 144–58.
Wichura, Michael J. 1988. “Algorithm AS 241: The Percentage Points of the Normal Distribution.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 37 (3): 477–84.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.