Approximating probability distributions by a Gaussian with the same mode. Thanks to limit theorems this is not always a terrible idea, especially since Neural networks seem pretty keen to converge to Gaussians in various senses.

Specifically, locally approximates the posterior using a Gaussian \[ p(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right). \]

This makes it look like a special case of approximate method of moments, or integral probability metrics. Is that the case?

I am particularly keen to see it work for probabilistic neural nets. Such an approach is classic for neural nets (David J. C. MacKay 1992). There are many variants of this technique for different assumptions. Laplace approximations have the attractive feature of providing estimates for forward and inverse problems (Foong et al. 2019; Immer, Korzepa, and Bauer 2021).

The basic idea is that we hold \(x \in \mathbb{R}^{n}\) fixed and use the Jacobian matrix \(J(x):=\left.\nabla_{\theta} f(x ; \theta)\right|_{\theta_{\mathrm{MAP}}} \in \mathbb{R}^{d \times k}\), to the network as
\[
f(x ; \theta) \approx f\left(x ; \theta_{\mathrm{MAP}}\right)+J(x)^{\top}\left(\theta-\theta_{\mathrm{MAP}}\right)
\]
where the variance is now justifed as a Taylor expansion.
Under this approximation, since \(\theta\) is a posteriori distributed as Gaussian \(\mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right)\), it follows that the marginal distribution over the network output \(f(x)\) is also Gaussian, given by
\[
p(f(x) \mid x, \mathcal{D}) \sim \mathcal{N}\left(f\left(x ; \theta_{\mathrm{MAP}}\right), J(x)^{\top} \Sigma J(x)\right).
\]
For more on this, see (Bishop 2006, 5.167, 5.188).
It is essentially a gratis Laplace approximation in the sense that if I have fit the networks I can already calculate those Jacobians so I am probably 1 line of code away from getting some kind of uncertainty estimate.
However, I have no particular guarantees to hope that it is well calibrated, because the simplifications were chosen *a priori* and might not be appropriate to the combination of model and data that I actually have.

Recent work here ties many of these ideas together:

- AlexImmer/Laplace: Laplace approximations for Deep Learning.
- API documentation for the above
- Agustinus Kristiadi’s Modern Arts of Laplace Approximations

## Learnable Laplace approximations

Agustinus Kristiadi and team have created various methods for low-overhead neural uncertainty quantification via Laplace approximation that have greater flexibility for adaptively choosing the type and manner of approximation. See, e.g. Painless Uncertainty for Deep Learning and their papers (Kristiadi, Hein, and Hennig 2020, 2021).

Kristiadi, Hein, and Hennig (2021) which generalises to learnable uncertainty to, for example, allow the distribution to reflect uncertainty about datapoints drawn far from the training distribution.
They define an augmented *Learnable Uncertainty Laplace Approximation* (LULA) network \(\tilde{f}\) with more parameters \(\tilde{\theta}=\theta_{\mathrm{MAP}}, \hat{\theta}.\)

Let \(f: \mathbb{R}^{n} \times \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}\) be an \(L\)-layer neural network with a MAP-trained parameters \(\theta_{\text {MAP }}\) and let \(\widetilde{f}: \mathbb{R}^{n} \times \mathbb{R}^{\widetilde{d}} \rightarrow \mathbb{R}^{k}\) along with \(\widetilde{\theta}_{\text {MAP }}\) be obtained by adding LULA units. Let \(q(\widetilde{\theta}):=\mathcal{N}\left(\tilde{\theta}_{\mathrm{MAP}}, \widetilde{\Sigma}\right)\) be the Laplace-approximated posterior and \(p\left(y \mid x, \mathcal{D} ; \widetilde{\theta}_{\mathrm{MAP}}\right)\) be the (approximate) predictive distribution under the LA. Furthermore, let us denote the dataset sampled i.i.d. from the data distribution as \(\mathcal{D}_{\text {in }}\) and that from some outlier distribution as \(\mathcal{D}_{\text {out }}\), and let \(H\) be the entropy functional. We construct the following loss function to induce high uncertainty on outliers while maintaining high confidence over the data (inliers): \[ \begin{array}{rl} \mathcal{L}_{\text {LULA }}\left(\widetilde{\theta}_{\text {MAP }}\right)&:=\frac{1}{\left|\mathcal{D}_{\text {in }}\right|} \sum_{x_{\text {in }} \in \mathcal{D}_{\text {in }}} H\left[p\left(y \mid x_{\text {in }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \\ &-\frac{1}{\left|\mathcal{D}_{\text {out }}\right|} \sum_{x_{\text {out }} \in \mathcal{D}_{\text {out }}} H\left[p\left(y \mid x_{\text {out }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \end{array} \] and minimize it w.r.t. the free parameters \(\widehat{\theta}\).

I am assuming that by the *entropy functional* they mean the entropy of the normal distribution,
\[
H(\mathcal{N}(\mu, \sigma)) = {\frac {1}{2}}\ln \left((2\pi \mathrm {e} )^{k}\det \left({\boldsymbol {\Sigma }}\right)\right)
\]
but this looks expensive due to that determinant calculation in a (large) \(d\times d\) matrix.
Or possibly they mean some general entropy with respect to some density \(p\)
\[H(p)=\mathbb{E}_{p}\left[-\log p( x)\right]\]
which I suppose one could estimate as
\[H(p)=\frac{1}{N}\sum_{i=1}^N \left[-\log p(x_i)\right]\] without taking that normal Laplace approximation at this step, if we could find the density, and assuming the \(x_i\) were drawn from it.

The result is a slightly weird hybrid fitting procedure that requires two loss functions and which feels a little *ad hoc*, but maybe it works?

## By stochastic weight averaging

A Bayesian extension of a gradient descent trick called Stochastic Weight Averaging (Izmailov et al. 2018, 2020; Maddox et al. 2019; Wilson and Izmailov 2020).

## For model selection

We can estimate marginal likelihood with respect to hyperparameters by Laplace approximation, apparently. See Immer et al. (2021).

## In function spaces

Where the Laplace approximations are Gaussian Processes over some index space. TBC. See Piterbarg and Fatalov (1995); Wacker (2017); Alexanderian et al. (2016); Alexanderian (2021) and possibly Solin and Särkkä (2020) and possibly the INLA stuff.

## INLA

*Integrated nested Laplace approximation*
(Ingebrigtsen, Lindgren, and Steinsland 2014; Lindgren and Rue 2015; Rue et al. 2016).
TBC.

## Generalized Gauss-Newton and linearization

The *generalized Gauss-Newton approximation* (GGN) (Martens and Grosse 2015).
replaces an expensive second order derivative by a product of first order derivatives.
I am not sure if the *Kronecker-factored Approximate Curvature* (K-FAC) is a special case of GGN or a synonym.

Various papers analyze this (Foong et al. 2019; Immer, Korzepa, and Bauer 2021; Martens and Grosse 2015; Ritter, Botev, and Barber 2018).

## Laplace in inverse problems

## References

*arXiv:2005.12998 [Math]*, January.

*SIAM Journal on Scientific Computing*38 (1): A243–72.

*Pattern Recognition and Machine Learning*. Information Science and Statistics. New York: Springer.

*Journal of the American Statistical Association*88 (421): 9–25.

*arXiv:2106.14806 [Cs, Stat]*.

*arXiv:1906.11537 [Cs, Stat]*, June.

*arXiv:1809.09505 [Cs, Math, Stat]*, September.

*arXiv:2104.04975 [Cs, Stat]*, June.

*International Conference on Artificial Intelligence and Statistics*, 703–11. PMLR.

*Spatial Statistics*, Spatial Statistics Miami, 8 (May): 20–38.

*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*, 1169–79. PMLR.

*arXiv:1906.01930 [Cs, Stat]*, July.

*ICML 2020*.

*Uncertainty in Artificial Intelligence*.

*Journal of Statistical Software*63 (i19): 1–25.

*Computer Methods in Applied Mechanics and Engineering*259 (June): 24–39.

*Information Theory, Inference & Learning Algorithms*. Cambridge University Press.

*Neural Computation*4 (3): 448–72.

*Proceedings of the 32nd International Conference on Machine Learning*, 2408–17. PMLR.

*Extremes*21 (3): 441–62.

*Russian Mathematical Surveys*50 (6): 1151.

*arXiv:1604.00860 [Stat]*, September.

*arXiv:1404.5886 [Math, Stat]*, April.

*arXiv:1502.05700 [Stat]*, July.

*Statistics and Computing*30 (2): 419–46.

*arXiv:2107.10885 [Math, Stat]*, July.

*arXiv:1701.07989 [Math]*, April.

## No comments yet. Why not leave one?