Laplace approximations in inference

Lightweight uncertainties, especially for heavy neural nets



Second mode? I see no second mode.

Approximating probability distributions by a Gaussian with the same mode. Thanks to limit theorems this is not always a terrible idea, especially since Neural networks seem pretty keen to converge to Gaussians in various senses.

Specifically, locally approximates the posterior using a Gaussian \[ p(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right). \]

This makes it look like a special case of approximate method of moments, or integral probability metrics. Is that the case?

I am particularly keen to see it work for probabilistic neural nets. Such an approach is classic for neural nets (David J. C. MacKay 1992). There are many variants of this technique for different assumptions. Laplace approximations have the attractive feature of providing estimates for forward and inverse problems (Foong et al. 2019; Immer, Korzepa, and Bauer 2021).

The basic idea is that we hold \(x \in \mathbb{R}^{n}\) fixed and use the Jacobian matrix \(J(x):=\left.\nabla_{\theta} f(x ; \theta)\right|_{\theta_{\mathrm{MAP}}} \in \mathbb{R}^{d \times k}\), to the network as \[ f(x ; \theta) \approx f\left(x ; \theta_{\mathrm{MAP}}\right)+J(x)^{\top}\left(\theta-\theta_{\mathrm{MAP}}\right) \] where the variance is now justifed as a Taylor expansion. Under this approximation, since \(\theta\) is a posteriori distributed as Gaussian \(\mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right)\), it follows that the marginal distribution over the network output \(f(x)\) is also Gaussian, given by \[ p(f(x) \mid x, \mathcal{D}) \sim \mathcal{N}\left(f\left(x ; \theta_{\mathrm{MAP}}\right), J(x)^{\top} \Sigma J(x)\right). \] For more on this, see (Bishop 2006, 5.167, 5.188). It is essentially a gratis Laplace approximation in the sense that if I have fit the networks I can already calculate those Jacobians so I am probably 1 line of code away from getting some kind of uncertainty estimate. However, I have no particular guarantees to hope that it is well calibrated, because the simplifications were chosen a priori and might not be appropriate to the combination of model and data that I actually have.

Recent work here ties many of these ideas together:

Learnable Laplace approximations

Agustinus Kristiadi and team have created various methods for low-overhead neural uncertainty quantification via Laplace approximation that have greater flexibility for adaptively choosing the type and manner of approximation. See, e.g. Painless Uncertainty for Deep Learning and their papers (Kristiadi, Hein, and Hennig 2020, 2021).

Kristiadi, Hein, and Hennig (2021) which generalises to learnable uncertainty to, for example, allow the distribution to reflect uncertainty about datapoints drawn far from the training distribution. They define an augmented Learnable Uncertainty Laplace Approximation (LULA) network \(\tilde{f}\) with more parameters \(\tilde{\theta}=\theta_{\mathrm{MAP}}, \hat{\theta}.\)

Let \(f: \mathbb{R}^{n} \times \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}\) be an \(L\)-layer neural network with a MAP-trained parameters \(\theta_{\text {MAP }}\) and let \(\widetilde{f}: \mathbb{R}^{n} \times \mathbb{R}^{\widetilde{d}} \rightarrow \mathbb{R}^{k}\) along with \(\widetilde{\theta}_{\text {MAP }}\) be obtained by adding LULA units. Let \(q(\widetilde{\theta}):=\mathcal{N}\left(\tilde{\theta}_{\mathrm{MAP}}, \widetilde{\Sigma}\right)\) be the Laplace-approximated posterior and \(p\left(y \mid x, \mathcal{D} ; \widetilde{\theta}_{\mathrm{MAP}}\right)\) be the (approximate) predictive distribution under the LA. Furthermore, let us denote the dataset sampled i.i.d. from the data distribution as \(\mathcal{D}_{\text {in }}\) and that from some outlier distribution as \(\mathcal{D}_{\text {out }}\), and let \(H\) be the entropy functional. We construct the following loss function to induce high uncertainty on outliers while maintaining high confidence over the data (inliers): \[ \begin{array}{rl} \mathcal{L}_{\text {LULA }}\left(\widetilde{\theta}_{\text {MAP }}\right)&:=\frac{1}{\left|\mathcal{D}_{\text {in }}\right|} \sum_{x_{\text {in }} \in \mathcal{D}_{\text {in }}} H\left[p\left(y \mid x_{\text {in }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \\ &-\frac{1}{\left|\mathcal{D}_{\text {out }}\right|} \sum_{x_{\text {out }} \in \mathcal{D}_{\text {out }}} H\left[p\left(y \mid x_{\text {out }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \end{array} \] and minimize it w.r.t. the free parameters \(\widehat{\theta}\).

I am assuming that by the entropy functional they mean the entropy of the normal distribution, \[ H(\mathcal{N}(\mu, \sigma)) = {\frac {1}{2}}\ln \left((2\pi \mathrm {e} )^{k}\det \left({\boldsymbol {\Sigma }}\right)\right) \] but this looks expensive due to that determinant calculation in a (large) \(d\times d\) matrix. Or possibly they mean some general entropy with respect to some density \(p\) \[H(p)=\mathbb{E}_{p}\left[-\log p( x)\right]\] which I suppose one could estimate as \[H(p)=\frac{1}{N}\sum_{i=1}^N \left[-\log p(x_i)\right]\] without taking that normal Laplace approximation at this step, if we could find the density, and assuming the \(x_i\) were drawn from it.

The result is a slightly weird hybrid fitting procedure that requires two loss functions and which feels a little ad hoc, but maybe it works?

By stochastic weight averaging

A Bayesian extension of a gradient descent trick called Stochastic Weight Averaging (Izmailov et al. 2018, 2020; Maddox et al. 2019; Wilson and Izmailov 2020).

For model selection

We can estimate marginal likelihood with respect to hyperparameters by Laplace approximation, apparently. See Immer et al. (2021).

In function spaces

Where the Laplace approximations are Gaussian Processes over some index space. TBC. See Piterbarg and Fatalov (1995); Wacker (2017); Alexanderian et al. (2016); Alexanderian (2021) and possibly Solin and Särkkä (2020) and possibly the INLA stuff.

INLA

Integrated nested Laplace approximation (Ingebrigtsen, Lindgren, and Steinsland 2014; Lindgren and Rue 2015; Rue et al. 2016). TBC.

Generalized Gauss-Newton and linearization

The generalized Gauss-Newton approximation (GGN) (Martens and Grosse 2015). replaces an expensive second order derivative by a product of first order derivatives. I am not sure if the Kronecker-factored Approximate Curvature (K-FAC) is a special case of GGN or a synonym.

Various papers analyze this (Foong et al. 2019; Immer, Korzepa, and Bauer 2021; Martens and Grosse 2015; Ritter, Botev, and Barber 2018).

Laplace in inverse problems

See Bayes inverse problems.

References

Alexanderian, Alen. 2021. Optimal Experimental Design for Infinite-Dimensional Bayesian Inverse Problems Governed by PDEs: A Review.” arXiv:2005.12998 [Math], January.
Alexanderian, Alen, Noemi Petra, Georg Stadler, and Omar Ghattas. 2016. A Fast and Scalable Method for A-Optimal Design of Experiments for Infinite-Dimensional Bayesian Nonlinear Inverse Problems.” SIAM Journal on Scientific Computing 38 (1): A243–72.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Breslow, N. E., and D. G. Clayton. 1993. Approximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88 (421): 9–25.
Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. Laplace Redux – Effortless Bayesian Deep Learning.” In arXiv:2106.14806 [Cs, Stat].
Flaxman, Seth, Andrew Gordon Wilson, Daniel B Neill, Hannes Nickisch, and Alexander J Smola. 2015. “Fast Kronecker Inference in Gaussian Processes with Non-Gaussian Likelihoods.” In, 10.
Foong, Andrew Y. K., Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. 2019. ‘In-Between’ Uncertainty in Bayesian Neural Networks.” arXiv:1906.11537 [Cs, Stat], June.
Gorad, Ajinkya, Zheng Zhao, and Simo särkkä. 2020. “Parameter Estimation in Non-Linear State-Space Models by Automatic Differentiation of Non-Linear Kalman Filters.” In, 6.
Huggins, Jonathan H., Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. 2018. Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” arXiv:1809.09505 [Cs, Math, Stat], September.
Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. 2021. Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning.” arXiv:2104.04975 [Cs, Stat], June.
Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021. Improving Predictions of Bayesian Neural Nets via Local Linearization.” In International Conference on Artificial Intelligence and Statistics, 703–11. PMLR.
Ingebrigtsen, Rikke, Finn Lindgren, and Ingelin Steinsland. 2014. Spatial Models with Explanatory Variables in the Dependence Structure.” Spatial Statistics, Spatial Statistics Miami, 8 (May): 20–38.
Izmailov, Pavel, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2020. Subspace Inference for Bayesian Deep Learning.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 1169–79. PMLR.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging Weights Leads to Wider Optima and Better Generalization,” March.
Khan, Mohammad Emtiyaz, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. 2020. Approximate Inference Turns Deep Networks into Gaussian Processes.” arXiv:1906.01930 [Cs, Stat], July.
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. 2020. Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks.” In ICML 2020.
———. 2021. Learnable Uncertainty Under Laplace Approximations.” In Uncertainty in Artificial Intelligence.
Lindgren, Finn, and Håvard Rue. 2015. Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (i19): 1–25.
Long, Quan, Marco Scavino, Raúl Tempone, and Suojin Wang. 2013. Fast Estimation of Expected Information Gains for Bayesian Experimental Designs Based on Laplace Approximations.” Computer Methods in Applied Mechanics and Engineering 259 (June): 24–39.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
MacKay, David J. C. 1992. A Practical Bayesian Framework for Backpropagation Networks.” Neural Computation 4 (3): 448–72.
Maddox, Wesley, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. 2019. A Simple Baseline for Bayesian Uncertainty in Deep Learning,” February.
Martens, James, and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” In Proceedings of the 32nd International Conference on Machine Learning, 2408–17. PMLR.
Opitz, Thomas, Raphaël Huser, Haakon Bakka, and Håvard Rue. 2018. INLA Goes Extreme: Bayesian Tail Regression for the Estimation of High Spatio-Temporal Quantiles.” Extremes 21 (3): 441–62.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. The Matrix Cookbook.”
Piterbarg, V. I., and V. R. Fatalov. 1995. The Laplace Method for Probability Measures in Banach Spaces.” Russian Mathematical Surveys 50 (6): 1151.
Ritter, Hippolyt, Aleksandar Botev, and David Barber. 2018. A Scalable Laplace Approximation for Neural Networks.” In.
Rue, Håvard, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren. 2016. Bayesian Computing with INLA: A Review.” arXiv:1604.00860 [Stat], September.
Saumard, Adrien, and Jon A. Wellner. 2014. Log-Concavity and Strong Log-Concavity: A Review.” arXiv:1404.5886 [Math, Stat], April.
Snoek, Jasper, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks.” arXiv:1502.05700 [Stat], July.
Solin, Arno, and Simo Särkkä. 2020. Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing 30 (2): 419–46.
Tang, Yanbo, and Nancy Reid. 2021. Laplace and Saddlepoint Approximations in High Dimensions.” arXiv:2107.10885 [Math, Stat], July.
Wacker, Philipp. 2017. Laplace’s Method in Bayesian Inverse Problems.” arXiv:1701.07989 [Math], April.
Wilson, Andrew Gordon, and Pavel Izmailov. 2020. Bayesian Deep Learning and a Probabilistic Perspective of Generalization,” February.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.