# Probabilistic neural nets

Bayesian and other probabilistic inference in overparameterized ML

January 12, 2017 — April 27, 2023

Inferring densities and distributions in a massively parameterised deep learning settingin a Bayesian manner. Probabvilistic networks are more general than Bayes.

Jospin et al. (2022) is a modern high-speed intro and summary of many approaches.

Radford Neal’s thesis is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.

Alex Graves’ poster of his paper of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) I found elucidating. (There is a 3rd party quick and dirty implementation.)

One could refer to the 2019 NeurIPS Bayes deep learning workshop site which introduced some more modern positioning.

Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.

We are free to consider classic neural network inference as sort-of a special case of Bayes inference. Specifically, we interpret the loss function $$\mathcal{L}$$ of a net $$f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k$$ in the likelihood setting \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned}

Obviously a few things are different from the point-estimate case; the parameter vector $$\theta$$ is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a whole fraught thing in ways we will mostly ignore for now. Usually a prior is by default something like $p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right)$ for want of a better idea. This ends up being equivalent to the “weight decay” regularisation in the sense that Bayesian priors and regularisations often are.

With that basis e could do the usual stuff for Bayes inference, like considering the predictive posterior $p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta$ Usually this posterior turns out to be intractable to calculate in the very-high-dimensional parameter spaces of NNs, so we choose something simpler. We could summarise our posterior update by simple maximum a posteriori estimate $\theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta).$ In this case we have recovered the classic training of non-Bayes nets with some ad hoc regularisation which we claim was secretly a prior. But we have no notion of predictive uncertainty if we stop there.

Usually the model will possess many optima, and this will lead suspicion that we have not found a good global one. How do we maximise model evidence here in any case?

Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate there are various approximations to Bayes inference we might try. What follows is a non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets with different trade-offs.

🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.

## 1 Natural Posterior Network

borchero/natural-posterior-network : some kind of reparameterization uncertainty?

## 2 MC sampling of weights by low-rank Matheron updates

This uses GP Matheron updates. Needs a shorter names but looks cool . The idea is that we keep weights random, but then create a sparse representation of the weights.

• Mean-field variational inference (MFVI): variational inference with fully factorised Gaussian (FFG) approximation.
• Variational inference with full-covariance Gaussian approximation (for each layer).
• Variational inference with inducing weights: each of the layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
• Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.

## 5 Sampling via Monte Carlo

TBD. For now, if the number of parameters is smallish see Hamiltonian Monte Carlo.

## 7 Via random projections

I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.

## 8 In Gaussian process regression

See kernel learning.

See wide NN.

## 11 Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

## 12 Ensemble methods

Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout) although dropout as he frames it has been contested these days as a means of inference.

## 13 Neural GLM

I think this has sparse bayes flavour. D. Tran et al. (2019); seems to randomise over input params?

## 14 Practicalities

The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

See MCMC by SGD.

## 16 Khan and Rue’s Bayes Learning Rule

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

## 17 Incoming

Dustin Tran’s uncertainty layers [1812.03973] Bayesian Layers: A Module for Neural Network Uncertainty:

In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation — be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.…

While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layers’ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.

## 18 References

Abbasnejad, Dick, and Hengel. 2016. In Advances in Neural Information Processing Systems 29.
Alexanderian. 2021. arXiv:2005.12998 [Math].
Alexanderian, Petra, Stadler, et al. 2016. SIAM Journal on Scientific Computing.
Alexos, Boyd, and Mandt. 2022. In Proceedings of the 39th International Conference on Machine Learning.
Alquier. 2021. arXiv:2110.11216 [Cs, Math, Stat].
———. 2023.
Archer, Park, Buesing, et al. 2015. arXiv:1511.07367 [Stat].
Bao, Ye, Zang, et al. 2020. Inverse Problems.
Baydin, Shao, Bhimji, et al. 2019. In arXiv:1907.03382 [Cs, Stat].
Bazzani, Torresani, and Larochelle. 2017. “Recurrent Mixture Density Network for Spatiotemporal Visual Attention.”
Bishop, Christopher. 1994. Microsoft Research.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.
Blundell, Cornebise, Kavukcuoglu, et al. 2015. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML’15.
Bora, Jalal, Price, et al. 2017. In International Conference on Machine Learning.
Breslow, and Clayton. 1993. Journal of the American Statistical Association.
Bui, Ravi, and Ramavajjala. 2017. arXiv:1703.04818 [Cs].
Chada, and Tong. 2022. Mathematics of Computation.
Charpentier, Borchert, Zügner, et al. 2022. arXiv:2105.04471 [Cs, Stat].
Chen, Wilson Ye, Mackey, Gorham, et al. 2018. Stein Points.” In Proceedings of the 35th International Conference on Machine Learning.
Chen, Tian Qi, Rubanova, Bettencourt, et al. 2018. In Advances in Neural Information Processing Systems 31.
Chu, Jin, Zhu, et al. 2022. In Proceedings of the 39th International Conference on Machine Learning.
Cutajar, Bonilla, Michiardi, et al. 2017. In PMLR.
Damianou, and Lawrence. 2013. In Artificial Intelligence and Statistics.
Dandekar, Chung, Dixit, et al. 2021. arXiv:2012.07244 [Cs].
Daxberger, Kristiadi, Immer, et al. 2021. In arXiv:2106.14806 [Cs, Stat].
de Castro, and Dorigo. 2019. Computer Physics Communications.
Dezfouli, and Bonilla. 2015. In Advances in Neural Information Processing Systems 28. NIPS’15.
Doerr, Daniel, Schiegg, et al. 2018. arXiv:1801.10395 [Stat].
Domingos. 2020. arXiv:2012.00152 [Cs, Stat].
Dunlop, Girolami, Stuart, et al. 2018. Journal of Machine Learning Research.
Dupont, Doucet, and Teh. 2019. arXiv:1904.01681 [Cs, Stat].
Dusenberry, Jerfel, Wen, et al. 2020. In Proceedings of the 37th International Conference on Machine Learning.
Dutordoir, Hensman, van der Wilk, et al. 2021. In arXiv:2105.04504 [Cs, Stat].
Dziugaite, and Roy. 2017. arXiv:1703.11008 [Cs].
Eleftheriadis, Nicholson, Deisenroth, et al. 2017. In Advances in Neural Information Processing Systems 30.
Fabius, and van Amersfoort. 2014. In Proceedings of ICLR.
Figurnov, Mohamed, and Mnih. 2018. In Advances in Neural Information Processing Systems 31.
Flaxman, Wilson, Neill, et al. 2015. “Fast Kronecker Inference in Gaussian Processes with Non-Gaussian Likelihoods.” In.
Flunkert, Salinas, and Gasthaus. 2017. arXiv:1704.04110 [Cs, Stat].
Foong, Li, Hernández-Lobato, et al. 2019. arXiv:1906.11537 [Cs, Stat].
Fortuin. 2022. International Statistical Review.
Gal. 2015. “Rapid Prototyping of Probabilistic Models: Emerging Challenges in Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2016. “Uncertainty in Deep Learning.”
Gal, and Ghahramani. 2015a. “On Modern Deep Learning and Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2015b. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).
———. 2016a. In arXiv:1512.05287 [Stat].
———. 2016b. In 4th International Conference on Learning Representations (ICLR) Workshop Track.
———. 2016c. arXiv:1506.02157 [Stat].
Gal, Hron, and Kendall. 2017. arXiv:1705.07832 [Stat].
Garnelo, Rosenbaum, Maddison, et al. 2018. arXiv:1807.01613 [Cs, Stat].
Garnelo, Schwarz, Rosenbaum, et al. 2018.
Gholami, Keutzer, and Biros. 2019. arXiv:1902.10298 [Cs].
Giryes, Sapiro, and Bronstein. 2016. IEEE Transactions on Signal Processing.
Gorad, Zhao, and Särkkä. 2020. “Parameter Estimation in Non-Linear State-Space Models by Automatic Differentiation of Non-Linear Kalman Filters.” In.
Gourieroux, Monfort, and Renault. 1993. Journal of Applied Econometrics.
Graves. 2011. In Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11.
———. 2013. arXiv:1308.0850 [Cs].
Graves, Mohamed, and Hinton. 2013. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
Gregor, Danihelka, Graves, et al. 2015. arXiv:1502.04623 [Cs].
Gu, Ghahramani, and Turner. 2015. In Advances in Neural Information Processing Systems 28.
Gu, Levine, Sutskever, et al. 2016. In Proceedings of ICLR.
Guo, Pleiss, Sun, et al. 2017.
Gurevich, and Stuke. 2019.
Guth, Mojahed, and Sapsis. 2023. SSRN Scholarly Paper.
Haber, Lucka, and Ruthotto. 2018. arXiv:1805.08034 [Cs, Math].
Haykin, ed. 2001. Kalman Filtering and Neural Networks. Adaptive and Learning Systems for Signal Processing, Communications, and Control.
He, Lakshminarayanan, and Teh. 2020. In Advances in Neural Information Processing Systems.
Hoffman, and Blei. 2015. In PMLR.
Huggins, Campbell, Kasprzak, et al. 2018. arXiv:1809.09505 [Cs, Math, Stat].
Hu, Yang, Salakhutdinov, et al. 2018. In arXiv:1706.00550 [Cs, Stat].
Immer, Bauer, Fortuin, et al. 2021. In Proceedings of the 38th International Conference on Machine Learning.
Immer, Korzepa, and Bauer. 2021. In International Conference on Artificial Intelligence and Statistics.
Ingebrigtsen, Lindgren, and Steinsland. 2014. Spatial Statistics, Spatial Statistics Miami,.
Izmailov, Maddox, Kirichenko, et al. 2020. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference.
Izmailov, Podoprikhin, Garipov, et al. 2018.
Jospin, Buntine, Boussaid, et al. 2022. arXiv:2007.06823 [Cs, Stat].
Karl, Soelch, Bayer, et al. 2016. In Proceedings of ICLR.
Khan, Immer, Abedi, et al. 2020. arXiv:1906.01930 [Cs, Stat].
Khan, and Lin. 2017. In Artificial Intelligence and Statistics.
Khan, and Rue. 2023.
Kingma, Diederik P. 2017.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., and Welling. 2014. In ICLR 2014 Conference.
Kovachki, and Stuart. 2019. Inverse Problems.
Krauth, Bonilla, Cutajar, et al. 2016. In UAI17.
Krishnan, Shalit, and Sontag. 2015. arXiv Preprint arXiv:1511.05121.
Kristiadi, Hein, and Hennig. 2020. In ICML 2020.
———. 2021a. Advances in Neural Information Processing Systems.
———. 2021b. In Uncertainty in Artificial Intelligence.
———. 2022. In CoRR.
Larsen, Sønderby, Larochelle, et al. 2015. arXiv:1512.09300 [Cs, Stat].
Le, Baydin, and Wood. 2017. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). Proceedings of Machine Learning Research.
Lee, Jaehoon, Bahri, Novak, et al. 2018. In ICLR.
Lee, Herbert K. H., Higdon, Bi, et al. 2002. Technometrics.
Le, Igl, Jin, et al. 2017. arXiv Preprint arXiv:1705.10306.
Lindgren, and Rue. 2015. Journal of Statistical Software.
Liu, Qiang, and Wang. 2019. In Advances In Neural Information Processing Systems.
Liu, Xiao, Yeo, and Lu. 2020. Journal of the American Statistical Association.
Lobacheva, Chirkova, and Vetrov. 2017. In Workshop on Learning to Generate Natural Language.
Long, Scavino, Tempone, et al. 2013. Computer Methods in Applied Mechanics and Engineering.
Lorsung. 2021.
Louizos, Shalit, Mooij, et al. 2017. In Advances in Neural Information Processing Systems 30.
Louizos, and Welling. 2016. In arXiv Preprint arXiv:1603.04733.
———. 2017. In PMLR.
Mackay. 1992. Neural Computation.
MacKay. 2002. Information Theory, Inference & Learning Algorithms.
Maddison, Lawson, Tucker, et al. 2017. arXiv Preprint arXiv:1705.09279.
Maddox, Garipov, Izmailov, et al. 2019.
Mandt, Hoffman, and Blei. 2017. JMLR.
Margossian, Vehtari, Simpson, et al. 2020. arXiv:2004.12550 [Stat].
Martens, and Grosse. 2015. In Proceedings of the 32nd International Conference on Machine Learning.
Matthews, Rowland, Hron, et al. 2018. In arXiv:1804.11271 [Cs, Stat].
Matthews, van der Wilk, Nickson, et al. 2016. arXiv:1610.08733 [Stat].
Molchanov, Ashukha, and Vetrov. 2017. In Proceedings of ICML.
Murphy. 2023. Probabilistic Machine Learning: Advanced Topics.
Neal. 1996.
Ngiam, Chen, Koh, et al. 2011. In Proceedings of the 28th International Conference on Machine Learning (ICML-11).
Ngufor, van Houten, Caffo, et al. 2019. Journal of Biomedical Informatics.
Oakley, and Youngman. 2017. Technometrics.
Ober, and Rasmussen. 2019. In.
Opitz, Huser, Bakka, et al. 2018. Extremes.
Ovadia, Fertig, Ren, et al. 2019. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
Pan, Kuo, Rilee, et al. 2021. arXiv:2111.08239 [Cs, Stat].
Papadopoulos, Edwards, and Murray. 2001. IEEE Transactions on Neural Networks.
Papamakarios, Murray, and Pavlakou. 2017. In Advances in Neural Information Processing Systems 30.
Papamarkou, Skoularidou, Palla, et al. 2024.
Partee, Ringenburg, Robbins, et al. 2019. “Model Parameter Optimization: ML-Guided Trans-Resolution Tuning of Physical Models.” In.
Peluchetti, and Favaro. 2020. In International Conference on Artificial Intelligence and Statistics.
Petersen, and Pedersen. 2012.
Piterbarg, and Fatalov. 1995. Russian Mathematical Surveys.
Psaros, Meng, Zou, et al. 2023. Journal of Computational Physics.
Raissi, Perdikaris, and Karniadakis. 2019. Journal of Computational Physics.
Rasmussen, and Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning.
Rezende, Danilo Jimenez, and Mohamed. 2015. In International Conference on Machine Learning. ICML’15.
Rezende, Danilo J, Racanière, Higgins, et al. 2019. “Equivariant Hamiltonian Flows.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS).
Ritter, Botev, and Barber. 2018. In.
Ritter, and Karaletsos. 2022. Proceedings of Machine Learning and Systems.
Ritter, Kukla, Zhang, et al. 2021. arXiv:2105.14594 [Cs, Stat].
Rue, Riebler, Sørbye, et al. 2016. arXiv:1604.00860 [Stat].
Ruiz, Titsias, and Blei. 2016. In Advances In Neural Information Processing Systems.
Ryder, Golightly, McGough, et al. 2018. arXiv:1802.03335 [Stat].
Sanchez-Gonzalez, Bapst, Battaglia, et al. 2019. “Hamiltonian Graph Networks with ODE Integrators.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS).
Saumard, and Wellner. 2014. arXiv:1404.5886 [Math, Stat].
Shi, Sun, and Zhu. 2018. In.
Sigrist, Künsch, and Stahel. 2015. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Simchoni, and Rosset. 2023.
Snoek, Rippel, Swersky, et al. 2015. In Proceedings of the 32nd International Conference on Machine Learning.
Solin, and Särkkä. 2020. Statistics and Computing.
Sun, Zhang, Shi, et al. 2019. In.
Tang, and Reid. 2021. arXiv:2107.10885 [Math, Stat].
Thakur, Lorsung, Yacoby, et al. 2021. arXiv:2006.11695 [Cs, Stat].
Tran, Dustin, Dusenberry, van der Wilk, et al. 2019. “Bayesian Layers: A Module for Neural Network Uncertainty.” Advances in Neural Information Processing Systems.
Tran, Dustin, Hoffman, Saurous, et al. 2017. In ICLR.
Tran, Dustin, Kucukelbir, Dieng, et al. 2016. arXiv:1610.09787 [Cs, Stat].
Tran, Ba-Hien, Rossi, Milios, et al. 2021. In Advances in Neural Information Processing Systems.
Tran, Ba-Hien, Rossi, Milios, et al. 2022. Journal of Machine Learning Research.
van den Berg, Hasenclever, Tomczak, et al. 2018. In UAI18.
Wacker. 2017. arXiv:1701.07989 [Math].
Wainwright, and Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing.
Watson, Lin, Klink, et al. 2020. “Neural Linear Models with Functional Gaussian Process Priors.” In.
Weber, Starc, Mittal, et al. 2018. In NeurIPS Workshop on Bayesian Deep Learning.
Wen, Tran, and Ba. 2020. In ICLR.
Wenzel, Roth, Veeling, et al. 2020. In Proceedings of the 37th International Conference on Machine Learning.
Wilson, and Izmailov. 2020.
Xu, and Darve. 2020. In arXiv:2011.11955 [Cs, Math].
Yang, Li, and Wang. 2021. arXiv:2101.12353 [Cs, Math, Stat].
Zeevi, and Meir. 1997. Neural Networks: The Official Journal of the International Neural Network Society.
Zellner. 1988. The American Statistician.