# Laplace approximations in inference

## Lightweight uncertainties, especially for heavy neural nets

Second mode? I see no second mode.

Approximating probability distributions by a Gaussian with the same mode. Thanks to limit theorems this is not always a terrible idea, especially since Neural networks seem pretty keen to converge to Gaussians in various senses.

Specifically, locally approximates the posterior using a Gaussian $p(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right).$

This makes it look like a special case of approximate method of moments, or integral probability metrics. Is that the case?

I am particularly keen to see it work for probabilistic neural nets. Such an approach is classic for neural nets . There are many variants of this technique for different assumptions. Laplace approximations have the attractive feature of providing estimates for forward and inverse problems .

The basic idea is that we hold $$x \in \mathbb{R}^{n}$$ fixed and use the Jacobian matrix $$J(x):=\left.\nabla_{\theta} f(x ; \theta)\right|_{\theta_{\mathrm{MAP}}} \in \mathbb{R}^{d \times k}$$, to the network as $f(x ; \theta) \approx f\left(x ; \theta_{\mathrm{MAP}}\right)+J(x)^{\top}\left(\theta-\theta_{\mathrm{MAP}}\right)$ where the variance is now justifed as a Taylor expansion. Under this approximation, since $$\theta$$ is a posteriori distributed as Gaussian $$\mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right)$$, it follows that the marginal distribution over the network output $$f(x)$$ is also Gaussian, given by $p(f(x) \mid x, \mathcal{D}) \sim \mathcal{N}\left(f\left(x ; \theta_{\mathrm{MAP}}\right), J(x)^{\top} \Sigma J(x)\right).$ For more on this, see . It is essentially a gratis Laplace approximation in the sense that if I have fit the networks I can already calculate those Jacobians so I am probably 1 line of code away from getting some kind of uncertainty estimate. However, I have no particular guarantees to hope that it is well calibrated, because the simplifications were chosen a priori and might not be appropriate to the combination of model and data that I actually have.

Recent work here ties many of these ideas together:

## Learnable Laplace approximations

Agustinus Kristiadi and team have created various methods for low-overhead neural uncertainty quantification via Laplace approximation that have greater flexibility for adaptively choosing the type and manner of approximation. See, e.g. Painless Uncertainty for Deep Learning and their papers .

Kristiadi, Hein, and Hennig (2021) which generalises to learnable uncertainty to, for example, allow the distribution to reflect uncertainty about datapoints drawn far from the training distribution. They define an augmented Learnable Uncertainty Laplace Approximation (LULA) network $$\tilde{f}$$ with more parameters $$\tilde{\theta}=\theta_{\mathrm{MAP}}, \hat{\theta}.$$

Let $$f: \mathbb{R}^{n} \times \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}$$ be an $$L$$-layer neural network with a MAP-trained parameters $$\theta_{\text {MAP }}$$ and let $$\widetilde{f}: \mathbb{R}^{n} \times \mathbb{R}^{\widetilde{d}} \rightarrow \mathbb{R}^{k}$$ along with $$\widetilde{\theta}_{\text {MAP }}$$ be obtained by adding LULA units. Let $$q(\widetilde{\theta}):=\mathcal{N}\left(\tilde{\theta}_{\mathrm{MAP}}, \widetilde{\Sigma}\right)$$ be the Laplace-approximated posterior and $$p\left(y \mid x, \mathcal{D} ; \widetilde{\theta}_{\mathrm{MAP}}\right)$$ be the (approximate) predictive distribution under the LA. Furthermore, let us denote the dataset sampled i.i.d. from the data distribution as $$\mathcal{D}_{\text {in }}$$ and that from some outlier distribution as $$\mathcal{D}_{\text {out }}$$, and let $$H$$ be the entropy functional. We construct the following loss function to induce high uncertainty on outliers while maintaining high confidence over the data (inliers): $\begin{array}{rl} \mathcal{L}_{\text {LULA }}\left(\widetilde{\theta}_{\text {MAP }}\right)&:=\frac{1}{\left|\mathcal{D}_{\text {in }}\right|} \sum_{x_{\text {in }} \in \mathcal{D}_{\text {in }}} H\left[p\left(y \mid x_{\text {in }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \\ &-\frac{1}{\left|\mathcal{D}_{\text {out }}\right|} \sum_{x_{\text {out }} \in \mathcal{D}_{\text {out }}} H\left[p\left(y \mid x_{\text {out }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \end{array}$ and minimize it w.r.t. the free parameters $$\widehat{\theta}$$.

I am assuming that by the entropy functional they mean the entropy of the normal distribution, $H(\mathcal{N}(\mu, \sigma)) = {\frac {1}{2}}\ln \left((2\pi \mathrm {e} )^{k}\det \left({\boldsymbol {\Sigma }}\right)\right)$ but this looks expensive due to that determinant calculation in a (large) $$d\times d$$ matrix. Or possibly they mean some general entropy with respect to some density $$p$$ $H(p)=\mathbb{E}_{p}\left[-\log p( x)\right]$ which I suppose one could estimate as $H(p)=\frac{1}{N}\sum_{i=1}^N \left[-\log p(x_i)\right]$ without taking that normal Laplace approximation at this step, if we could find the density, and assuming the $$x_i$$ were drawn from it.

The result is a slightly weird hybrid fitting procedure that requires two loss functions and which feels a little ad hoc, but maybe it works?

## By stochastic weight averaging

A Bayesian extension of a gradient descent trick called Stochastic Weight Averaging .

## For model selection

We can estimate marginal likelihood with respect to hyperparameters by Laplace approximation, apparently. See Immer et al. (2021).

## In function spaces

Where the Laplace approximations are Gaussian Processes over some index space. TBC. See Piterbarg and Fatalov (1995); Wacker (2017); Alexanderian et al. (2016); Alexanderian (2021) and possibly Solin and Särkkä (2020) and possibly the INLA stuff.

## INLA

Integrated nested Laplace approximation . TBC.

## Generalized Gauss-Newton and linearization

The generalized Gauss-Newton approximation (GGN) . replaces an expensive second order derivative by a product of first order derivatives. I am not sure if the Kronecker-factored Approximate Curvature (K-FAC) is a special case of GGN or a synonym.

Various papers analyze this .

## References

Alexanderian, Alen. 2021. arXiv:2005.12998 [Math], January.
Alexanderian, Alen, Noemi Petra, Georg Stadler, and Omar Ghattas. 2016. SIAM Journal on Scientific Computing 38 (1): A243–72.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Breslow, N. E., and D. G. Clayton. 1993. Journal of the American Statistical Association 88 (421): 9–25.
Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. In arXiv:2106.14806 [Cs, Stat].
Flaxman, Seth, Andrew Gordon Wilson, Daniel B Neill, Hannes Nickisch, and Alexander J Smola. 2015. “Fast Kronecker Inference in Gaussian Processes with Non-Gaussian Likelihoods.” In, 10.
Foong, Andrew Y. K., Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. 2019. arXiv:1906.11537 [Cs, Stat], June.
Gorad, Ajinkya, Zheng Zhao, and Simo särkkä. 2020. “Parameter Estimation in Non-Linear State-Space Models by Automatic Differentiation of Non-Linear Kalman Filters.” In, 6.
Huggins, Jonathan H., Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. 2018. arXiv:1809.09505 [Cs, Math, Stat], September.
Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. 2021. arXiv:2104.04975 [Cs, Stat], June.
Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021. In International Conference on Artificial Intelligence and Statistics, 703–11. PMLR.
Ingebrigtsen, Rikke, Finn Lindgren, and Ingelin Steinsland. 2014. Spatial Statistics, Spatial Statistics Miami, 8 (May): 20–38.
Izmailov, Pavel, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2020. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 1169–79. PMLR.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. March.
Khan, Mohammad Emtiyaz, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. 2020. arXiv:1906.01930 [Cs, Stat], July.
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. 2020. In ICML 2020.
———. 2021. In Uncertainty in Artificial Intelligence.
Lindgren, Finn, and Håvard Rue. 2015. Journal of Statistical Software 63 (i19): 1–25.
Long, Quan, Marco Scavino, Raúl Tempone, and Suojin Wang. 2013. Computer Methods in Applied Mechanics and Engineering 259 (June): 24–39.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
MacKay, David J. C. 1992. Neural Computation 4 (3): 448–72.
Maddox, Wesley, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. 2019. February.
Martens, James, and Roger Grosse. 2015. In Proceedings of the 32nd International Conference on Machine Learning, 2408–17. PMLR.
Opitz, Thomas, Raphaël Huser, Haakon Bakka, and Håvard Rue. 2018. Extremes 21 (3): 441–62.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012.
Piterbarg, V. I., and V. R. Fatalov. 1995. Russian Mathematical Surveys 50 (6): 1151.
Ritter, Hippolyt, Aleksandar Botev, and David Barber. 2018. In.
Rue, Håvard, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren. 2016. arXiv:1604.00860 [Stat], September.
Saumard, Adrien, and Jon A. Wellner. 2014. arXiv:1404.5886 [Math, Stat], April.
Snoek, Jasper, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. arXiv:1502.05700 [Stat], July.
Solin, Arno, and Simo Särkkä. 2020. Statistics and Computing 30 (2): 419–46.
Tang, Yanbo, and Nancy Reid. 2021. arXiv:2107.10885 [Math, Stat], July.
Wacker, Philipp. 2017. arXiv:1701.07989 [Math], April.
Wilson, Andrew Gordon, and Pavel Izmailov. 2020. February.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.