Laplace approximations in inference

Lightweight uncertainties, especially for heavy neural nets



Assumed audience:

People who vaguely remember Laplace approximation from calculus maybe but want to know how to make them useful in modern inference to realise there is a lot going on

\(\renewcommand{\var}{\operatorname{Var}} \renewcommand{\corr}{\operatorname{Corr}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\mathrm{#1}} \renewcommand{\dist}[1]{\mathcal{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}\)

Second mode? I see no second mode.

Approximating probability distributions by a Gaussian with the same mode. Thanks to limit theorems this is not always a terrible idea, especially since Neural networks seem pretty keen to converge to Gaussians in various senses.

Specifically, we (possibly locally) approximate some posterior density of interest using a Gaussian \[ p(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right). \] The Laplace trick in general uses a local curvature estimate to choose the covariance.

I am particularly keen to see it work for probabilistic neural nets, where it is a classic methods (Mackay 1992). There are many variants of the technique for different assumptions. We often see this applied to estimating the posterior predictive density, which is usually compact enough to be tractable, but also we can apply it to input uncertainty and even parameters under certain simplifications (Foong et al. 2019; Immer, Korzepa, and Bauer 2021).

Basic

In classic Laplace approximation, we assume that the parameters of our model have a Gaussian distribution (independently why not) both a prior and a posteriori. Specifically we then attempt to approximate these using local curvature of the likelihood function at some maximum a posteriori estimate.

⚠️ The next bit needs a do-over. I completely lost track of the prior and the hyperparameters.

Writing that in symbols, the basic idea is that we hold \(x \in \mathbb{R}^{n}\) fixed and use the Jacobian matrix \(J(x):=\left.\nabla_{\theta} f(x ; \theta)\right|_{\theta_{\mathrm{MAP}}} \in \mathbb{R}^{d \times n}\), to the network as \[ f(x ; \theta) \approx f\left(x ; \theta_{\mathrm{MAP}}\right)+J(x)^{\top}\left(\theta-\theta_{\mathrm{MAP}}\right) \] where the variance is now justified using a Taylor expansion and some smoothness assumptions on \(f\). Under this approximation, since \(\theta\) is by assumption Gaussian, \(\theta \sim \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right)\), it follows that the marginal distribution over the network output \(f(x)\) is also Gaussian, given by \[ f(x) \mid x, \mathcal{D} \sim \mathcal{N}\left(f\left(x ; \theta_{\mathrm{MAP}}\right), J(x)^{\top} \Sigma J(x)\right). \] For more on this, see e.g. (Bishop 2006, 5.167, 5.188). It can be essentially a gratis Laplace approximation in the sense that if I have fit the networks I can already calculate those Jacobians so I am probably 1 line of code away from getting some kind of uncertainty estimate. However, I have no particular guarantees that it is well calibrated — is it meaningfully estimating the “true” uncertainty in the model subject to all the uncertainties with this simplified model structure?

Last Layer Laplace

a.k.a. Neural Linear models.

AFAICT this case is the simplest. We are concerned with the density over the predictive, so we start with a neural network. Then we treat the neural network as a feature generator in all the layers up to the last one, and treat the last layer probabilistically, as an adaptive-basis regression or classification problem, to get a decent learnable predictive uncertainty. I think this was implicit in Mackay (1992), but it was named in Snoek et al. (2015), critiqued and extended in Lorsung (2021).

For a simple practical example, see the Probflow tutorial.

Under a last-layer Laplace approximation, we write the joint model as \(\vrv{y}= \vrv{r}^{\top}\Phi(\vrv{u})\) so the joint distribution is \[\begin{align*} \left.\left[\begin{array}{c} \vrv{y} \\ \vrv{r} \end{array}\right]\right|\vrv{u} &\sim\dist{N}\left( \left[\begin{array}{c} \vv{m}_{\vrv{y}}\\ \vv{m}_{\vrv{r}} \end{array}\right], \left[\begin{array}{cc} \mm{K}_{\vrv{y}\vrv{y}} & \mm{K}_{\vrv{y}\vrv{r}}^{\top} \\ \mm{K}_{\vrv{y}\vrv{r}} & \mm{K}_{\vrv{r}\vrv{r}} \end{array}\right] \right) \end{align*}\] with \[\begin{align*} \vv{m}_{\vrv{y}} &=\vv{m}_{\vrv{r}}^{\top}\Phi(\vrv{u}) \\ \mm{K}_{\vrv{y}\vrv{r}} &=\Phi(\vrv{u}) \mm{K}_{\vrv{r}\vrv{r}}\\ \mm{K}_{\vrv{y}\vrv{y}} &= \Phi(\vrv{u})\mm{K}_{\vrv{r}\vrv{r}} \Phi^{\top} (\vrv{u})+ \sigma^2\mm{I}. \end{align*}\] Here \(\vrv{r}\sim \dist{N}\left(\vv{m}_{\vrv{r}}, \mm{K}_{\vrv{r}\vrv{r}}\right)\) is the random weighting, and \(\Phi(\vrv{u})\) is called the feature map.

Learnable Laplace approximations

NB ⚠️⛔️☣️: This sections is vague half-arsed notes of dubious accuracy.

Agustinus Kristiadi and team have created various methods for low-overhead neural uncertainty quantification via Laplace approximation that have greater flexibility for adaptively choosing the type and manner of approximation. See, e.g. Painless Uncertainty for Deep Learning and their papers (Kristiadi, Hein, and Hennig 2020, 2021).

Kristiadi, Hein, and Hennig (2021) generalises to learnable uncertainty to, for example, allow the distribution to reflect uncertainty about datapoints drawn far from the training distribution. They define an augmented Learnable Uncertainty Laplace Approximation (LULA) network \(\tilde{f}\) with more parameters \(\tilde{\theta}=\theta_{\mathrm{MAP}}, \hat{\theta}.\)

Let \(f: \mathbb{R}^{n} \times \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}\) be an \(L\)-layer neural network with a MAP-trained parameters \(\theta_{\text {MAP }}\) and let \(\widetilde{f}: \mathbb{R}^{n} \times \mathbb{R}^{\widetilde{d}} \rightarrow \mathbb{R}^{k}\) along with \(\widetilde{\theta}_{\text {MAP }}\) be obtained by adding LULA units. Let \(q(\widetilde{\theta}):=\mathcal{N}\left(\tilde{\theta}_{\mathrm{MAP}}, \widetilde{\Sigma}\right)\) be the Laplace-approximated posterior and \(p\left(y \mid x, \mathcal{D} ; \widetilde{\theta}_{\mathrm{MAP}}\right)\) be the (approximate) predictive distribution under the LA. Furthermore, let us denote the dataset sampled i.i.d. from the data distribution as \(\mathcal{D}_{\text {in }}\) and that from some outlier distribution as \(\mathcal{D}_{\text {out }}\), and let \(H\) be the entropy functional. We construct the following loss function to induce high uncertainty on outliers while maintaining high confidence over the data (inliers): \[ \begin{array}{rl} \mathcal{L}_{\text {LULA }}\left(\widetilde{\theta}_{\text {MAP }}\right)&:=\frac{1}{\left|\mathcal{D}_{\text {in }}\right|} \sum_{x_{\text {in }} \in \mathcal{D}_{\text {in }}} H\left[p\left(y \mid x_{\text {in }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \\ &-\frac{1}{\left|\mathcal{D}_{\text {out }}\right|} \sum_{x_{\text {out }} \in \mathcal{D}_{\text {out }}} H\left[p\left(y \mid x_{\text {out }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \end{array} \] and minimize it w.r.t. the free parameters \(\widehat{\theta}\).

I am assuming that by the entropy functional they mean the entropy of the normal distribution, \[ H(\mathcal{N}(\mu, \sigma)) = {\frac {1}{2}}\ln \left((2\pi \mathrm {e} )^{k}\det \left({\boldsymbol {\Sigma }}\right)\right) \] but this looks expensive due to that determinant calculation in a (large) \(d\times d\) matrix. Or possibly they mean some general entropy with respect to some density \(p\), \[H(p)=\mathbb{E}_{p}\left[-\log p( x)\right]\] which I suppose one could estimate as \[H(p)=\frac{1}{N}\sum_{i=1}^N \left[-\log p(x_i)\right]\] without taking that normal Laplace approximation at this step, if we could find the density, and assuming the \(x_i\) were drawn from it.

The result is a slightly weird hybrid fitting procedure that requires two loss functions and which feels a little ad hoc, but maybe it works?

By stochastic weight averaging

A Quasi-Bayesian extension of a gradient descent trick called Stochastic Weight Averaging (Izmailov et al. 2018, 2020; Maddox et al. 2019; Wilson and Izmailov 2020). AFAICT it precludes using a prior? Any may not actually be a Laplace approximation per se?

Variational

This is a different objective, no longer centred at a MAP estimate. See Variational Gaussian Approximation.

For model selection

NB ⚠️⛔️☣️: This sections is vague half-arsed notes of dubious accuracy.

We can estimate marginal likelihood with respect to hyperparameters by Laplace approximation, apparently. See Immer et al. (2021).

In function spaces

Where the Laplace approximations are Gaussian Processes over some index space. TBC. See Piterbarg and Fatalov (1995); Wacker (2017); Alexanderian et al. (2016); Alexanderian (2021) and possibly Solin and Särkkä (2020) and possibly the INLA stuff.

INLA

Integrated nested Laplace approximation leverages the GP-as-SDE idea to generalise Matérn-type covariances to interesting domains and non-stationarity.

Second order gradient matrices

See 2nd order gradient descnt.

Other covariance factorisations

I think the Kronecker-factored Approximate Curvature (K-FAC) is a famous one. There are others? (Flaxman et al. 2015; Martens and Grosse 2015; Ritter, Botev, and Barber 2018)

Tools

A toolkit combining all the NN Laplace tricks is introduced in Agustinus Kristiadi’s Modern Arts of Laplace Approximations is an excellent start.

References

Alexanderian, Alen. 2021. Optimal Experimental Design for Infinite-Dimensional Bayesian Inverse Problems Governed by PDEs: A Review.” arXiv:2005.12998 [Math], January.
Alexanderian, Alen, Noemi Petra, Georg Stadler, and Omar Ghattas. 2016. A Fast and Scalable Method for A-Optimal Design of Experiments for Infinite-Dimensional Bayesian Nonlinear Inverse Problems.” SIAM Journal on Scientific Computing 38 (1): A243–72.
Arras, Kai Oliver. 1998. An Introduction To Error Propagation: Derivation, Meaning and Examples of Equation CY = FX CX FXT,” 22.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Breslow, N. E., and D. G. Clayton. 1993. Approximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88 (421): 9–25.
Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. Laplace Redux — Effortless Bayesian Deep Learning.” In arXiv:2106.14806 [Cs, Stat].
Flaxman, Seth, Andrew Gordon Wilson, Daniel B Neill, Hannes Nickisch, and Alexander J Smola. 2015. “Fast Kronecker Inference in Gaussian Processes with Non-Gaussian Likelihoods.” In, 10.
Foong, Andrew Y. K., Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. 2019. ‘In-Between’ Uncertainty in Bayesian Neural Networks.” arXiv:1906.11537 [Cs, Stat], June.
Gorad, Ajinkya, Zheng Zhao, and Simo Särkkä. 2020. “Parameter Estimation in Non-Linear State-Space Models by Automatic Differentiation of Non-Linear Kalman Filters.” In, 6.
Huggins, Jonathan H., Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. 2018. Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” arXiv:1809.09505 [Cs, Math, Stat], September.
Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. 2021. Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning.” arXiv:2104.04975 [Cs, Stat], June.
Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021. Improving Predictions of Bayesian Neural Nets via Local Linearization.” In International Conference on Artificial Intelligence and Statistics, 703–11. PMLR.
Ingebrigtsen, Rikke, Finn Lindgren, and Ingelin Steinsland. 2014. Spatial Models with Explanatory Variables in the Dependence Structure.” Spatial Statistics, Spatial Statistics Miami, 8 (May): 20–38.
Izmailov, Pavel, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2020. Subspace Inference for Bayesian Deep Learning.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 1169–79. PMLR.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging Weights Leads to Wider Optima and Better Generalization,” March.
Khan, Mohammad Emtiyaz, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. 2020. Approximate Inference Turns Deep Networks into Gaussian Processes.” arXiv:1906.01930 [Cs, Stat], July.
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. 2020. Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks.” In ICML 2020.
———. 2021. Learnable Uncertainty Under Laplace Approximations.” In Uncertainty in Artificial Intelligence.
Lindgren, Finn, and Håvard Rue. 2015. Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (i19): 1–25.
Long, Quan, Marco Scavino, Raúl Tempone, and Suojin Wang. 2013. Fast Estimation of Expected Information Gains for Bayesian Experimental Designs Based on Laplace Approximations.” Computer Methods in Applied Mechanics and Engineering 259 (June): 24–39.
Lorsung, Cooper. 2021. Understanding Uncertainty in Bayesian Deep Learning.” arXiv.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
Mackay, David J. C. 1992. A Practical Bayesian Framework for Backpropagation Networks.” Neural Computation 4 (3): 448–72.
Maddox, Wesley, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. 2019. A Simple Baseline for Bayesian Uncertainty in Deep Learning,” February.
Margossian, Charles C., Aki Vehtari, Daniel Simpson, and Raj Agrawal. 2020. Hamiltonian Monte Carlo Using an Adjoint-Differentiated Laplace Approximation: Bayesian Inference for Latent Gaussian Models and Beyond.” arXiv:2004.12550 [Stat], October.
Martens, James, and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” In Proceedings of the 32nd International Conference on Machine Learning, 2408–17. PMLR.
Martino, Sara, and Andrea Riebler. 2019. Integrated Nested Laplace Approximations (INLA).” arXiv.
Ober, Sebastian W., and Carl E. Rasmussen. 2019. Benchmarking the Neural Linear Model for Regression.” In. arXiv.
Opitz, Thomas, Raphaël Huser, Haakon Bakka, and Håvard Rue. 2018. INLA Goes Extreme: Bayesian Tail Regression for the Estimation of High Spatio-Temporal Quantiles.” Extremes 21 (3): 441–62.
Papadopoulos, G., P.J. Edwards, and A.F. Murray. 2001. Confidence Estimation Methods for Neural Networks: A Practical Comparison.” IEEE Transactions on Neural Networks 12 (6): 1278–87.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. The Matrix Cookbook.”
Piterbarg, V. I., and V. R. Fatalov. 1995. The Laplace Method for Probability Measures in Banach Spaces.” Russian Mathematical Surveys 50 (6): 1151.
Ritter, Hippolyt, Aleksandar Botev, and David Barber. 2018. A Scalable Laplace Approximation for Neural Networks.” In.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.
Rue, Håvard, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren. 2016. Bayesian Computing with INLA: A Review.” arXiv:1604.00860 [Stat], September.
Saumard, Adrien, and Jon A. Wellner. 2014. Log-Concavity and Strong Log-Concavity: A Review.” arXiv:1404.5886 [Math, Stat], April.
Schraudolph, Nicol N. 2002. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.” Neural Computation 14 (7): 1723–38.
Snoek, Jasper, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks.” In Proceedings of the 32nd International Conference on Machine Learning.
Solin, Arno, and Simo Särkkä. 2020. Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing 30 (2): 419–46.
Stuart, Andrew M., and Aretha L. Teckentrup. 2016. Posterior Consistency for Gaussian Process Approximations of Bayesian Posterior Distributions.” arXiv:1603.02004 [Math], December.
Tang, Yanbo, and Nancy Reid. 2021. Laplace and Saddlepoint Approximations in High Dimensions.” arXiv:2107.10885 [Math, Stat], July.
Wacker, Philipp. 2017. Laplace’s Method in Bayesian Inverse Problems.” arXiv:1701.07989 [Math], April.
Watson, Joe, Jihao Andreas Lin, Pascal Klink, and Jan Peters. 2020. “Neural Linear Models with Functional Gaussian Process Priors,” 10.
Wilson, Andrew Gordon, and Pavel Izmailov. 2020. Bayesian Deep Learning and a Probabilistic Perspective of Generalization,” February.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.