# Probabilistic neural nets

Bayesian and other probabilistic inference in overparameterized ML

January 12, 2017 — April 27, 2023

Inferring densities and distributions in a massively parameterised deep learning settingin a Bayesian manner. Probabvilistic networks are more general than Bayes.

Jospin et al. (2022) is a modern high-speed intro and summary of many approaches.

Radford Neal’s thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.

Alex Graves’ poster of his paper (Graves 2011) of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) I found elucidating. (There is a 3rd party quick and dirty implementation.)

One could refer to the 2019 NeurIPS Bayes deep learning workshop site which introduced some more modern positioning.

Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.

We are free to consider classic neural network inference as sort-of a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting \[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]

Obviously a few things are different from the point-estimate case; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a whole fraught thing in ways we will mostly ignore for now. Usually a prior is by default something like \[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \] for want of a better idea. This ends up being equivalent to the “weight decay” regularisation in the sense that Bayesian priors and regularisations often are.

With that basis e could do the usual stuff for Bayes inference, like considering the predictive posterior \[
p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta
\] Usually this posterior turns out to be intractable to calculate in the very-high-dimensional parameter spaces of NNs, so we choose something simpler. We could summarise our posterior update by simple maximum a posteriori estimate \[
\theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta).
\] In this case we have recovered the classic training of non-Bayes nets with some *ad hoc* regularisation which we claim was secretly a prior. But we have no notion of predictive uncertainty if we stop there.

Usually the model will possess many optima, and this will lead suspicion that we have not found a good global one. How do we maximise model evidence here in any case?

Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate there are various approximations to Bayes inference we might try. What follows is a non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets with different trade-offs.

🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.

## 1 Natural Posterior Network

borchero/natural-posterior-network (Charpentier et al. 2022): some kind of reparameterization uncertainty?

## 2 MC sampling of weights by low-rank Matheron updates

This uses GP Matheron updates. Needs a shorter names but looks cool (Ritter et al. 2021). The idea is that we keep weights random, but then create a sparse representation of the weights.

microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in pytorch.

- Mean-field variational inference (MFVI): variational inference with fully factorised Gaussian (FFG) approximation.
- Variational inference with full-covariance Gaussian approximation (for each layer).
- Variational inference with inducing weights: each of the layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
- Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.

## 3 Bayes by backprop

See Bayes by backprop.

## 4 Variational autoencoders

## 5 Sampling via Monte Carlo

TBD. For now, if the number of parameters is smallish see Hamiltonian Monte Carlo.

## 6 Laplace approximation

See Laplace approximations AlexImmer/Laplace: Laplace approximations for Deep Learning.

## 7 Via random projections

I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.

## 8 In Gaussian process regression

See kernel learning.

## 9 Via measure transport

See reparameterization.

## 10 Via infinite-width random nets

See wide NN.

## 11 Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

## 12 Ensemble methods

Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout) although dropout as he frames it has been contested these days as a means of inference.

## 13 Neural GLM

I think this has sparse bayes flavour. D. Tran et al. (2019); seems to randomise over input params?

## 14 Practicalities

The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

## 15 Stochastic Gradient Descent as MC inference

See MCMC by SGD.

## 16 Khan and Rue’s Bayes Learning Rule

Bayes via natural gradient (Khan and Rue 2023; Zellner 1988).

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

## 17 Incoming

Dustin Tran’s uncertainty layers [1812.03973] Bayesian Layers: A Module for Neural Network Uncertainty:

In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation — be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.…

While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layers’ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.

- Bayesian Neural Networks by Duvenaud’s team

## 18 References

*Advances in Neural Information Processing Systems 29*.

*arXiv:2005.12998 [Math]*.

*SIAM Journal on Scientific Computing*.

*Proceedings of the 39th International Conference on Machine Learning*.

*arXiv:2110.11216 [Cs, Math, Stat]*.

*arXiv:1511.07367 [Stat]*.

*Inverse Problems*.

*arXiv:1907.03382 [Cs, Stat]*.

*Microsoft Research*.

*Pattern Recognition and Machine Learning*. Information Science and Statistics.

*Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37*. ICML’15.

*International Conference on Machine Learning*.

*Journal of the American Statistical Association*.

*arXiv:1703.04818 [Cs]*.

*Mathematics of Computation*.

*arXiv:2105.04471 [Cs, Stat]*.

*Proceedings of the 35th International Conference on Machine Learning*.

*Advances in Neural Information Processing Systems 31*.

*Proceedings of the 39th International Conference on Machine Learning*.

*PMLR*.

*Artificial Intelligence and Statistics*.

*arXiv:2012.07244 [Cs]*.

*arXiv:2106.14806 [Cs, Stat]*.

*Computer Physics Communications*.

*Advances in Neural Information Processing Systems 28*. NIPS’15.

*arXiv:1801.10395 [Stat]*.

*arXiv:2012.00152 [Cs, Stat]*.

*Journal of Machine Learning Research*.

*arXiv:1904.01681 [Cs, Stat]*.

*Proceedings of the 37th International Conference on Machine Learning*.

*arXiv:2105.04504 [Cs, Stat]*.

*arXiv:1703.11008 [Cs]*.

*Advances in Neural Information Processing Systems 30*.

*Proceedings of ICLR*.

*Advances in Neural Information Processing Systems 31*.

*arXiv:1704.04110 [Cs, Stat]*.

*arXiv:1906.11537 [Cs, Stat]*.

*International Statistical Review*.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*.

*arXiv:1512.05287 [Stat]*.

*4th International Conference on Learning Representations (ICLR) Workshop Track*.

*arXiv:1506.02157 [Stat]*.

*arXiv:1705.07832 [Stat]*.

*arXiv:1807.01613 [Cs, Stat]*.

*arXiv:1902.10298 [Cs]*.

*IEEE Transactions on Signal Processing*.

*Journal of Applied Econometrics*.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*. NIPS’11.

*arXiv:1308.0850 [Cs]*.

*2013 IEEE International Conference on Acoustics, Speech and Signal Processing*.

*arXiv:1502.04623 [Cs]*.

*Advances in Neural Information Processing Systems 28*.

*Proceedings of ICLR*.

*arXiv:1805.08034 [Cs, Math]*.

*Kalman Filtering and Neural Networks*. Adaptive and Learning Systems for Signal Processing, Communications, and Control.

*Advances in Neural Information Processing Systems*.

*PMLR*.

*arXiv:1809.09505 [Cs, Math, Stat]*.

*arXiv:1706.00550 [Cs, Stat]*.

*Proceedings of the 38th International Conference on Machine Learning*.

*International Conference on Artificial Intelligence and Statistics*.

*Spatial Statistics*, Spatial Statistics Miami,.

*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*.

*arXiv:2007.06823 [Cs, Stat]*.

*Proceedings of ICLR*.

*arXiv:1906.01930 [Cs, Stat]*.

*Artificial Intelligence and Statistics*.

*Advances in Neural Information Processing Systems 29*.

*ICLR 2014 Conference*.

*Inverse Problems*.

*UAI17*.

*arXiv Preprint arXiv:1511.05121*.

*ICML 2020*.

*Advances in Neural Information Processing Systems*.

*Uncertainty in Artificial Intelligence*.

*CoRR*.

*arXiv:1512.09300 [Cs, Stat]*.

*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*. Proceedings of Machine Learning Research.

*ICLR*.

*Technometrics*.

*arXiv Preprint arXiv:1705.10306*.

*Journal of Statistical Software*.

*Advances In Neural Information Processing Systems*.

*Journal of the American Statistical Association*.

*Workshop on Learning to Generate Natural Language*.

*Computer Methods in Applied Mechanics and Engineering*.

*Advances in Neural Information Processing Systems 30*.

*arXiv Preprint arXiv:1603.04733*.

*PMLR*.

*Neural Computation*.

*Information Theory, Inference & Learning Algorithms*.

*arXiv Preprint arXiv:1705.09279*.

*JMLR*.

*arXiv:2004.12550 [Stat]*.

*Proceedings of the 32nd International Conference on Machine Learning*.

*arXiv:1804.11271 [Cs, Stat]*.

*arXiv:1610.08733 [Stat]*.

*Proceedings of ICML*.

*Probabilistic Machine Learning: Advanced Topics*.

*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*.

*Journal of Biomedical Informatics*.

*Technometrics*.

*Extremes*.

*Proceedings of the 33rd International Conference on Neural Information Processing Systems*.

*arXiv:2111.08239 [Cs, Stat]*.

*IEEE Transactions on Neural Networks*.

*Advances in Neural Information Processing Systems 30*.

*International Conference on Artificial Intelligence and Statistics*.

*Russian Mathematical Surveys*.

*Journal of Computational Physics*.

*Journal of Computational Physics*.

*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning.

*International Conference on Machine Learning*. ICML’15.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*.

*Proceedings of Machine Learning and Systems*.

*arXiv:2105.14594 [Cs, Stat]*.

*arXiv:1604.00860 [Stat]*.

*Advances In Neural Information Processing Systems*.

*arXiv:1802.03335 [Stat]*.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*.

*arXiv:1404.5886 [Math, Stat]*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*Proceedings of the 32nd International Conference on Machine Learning*.

*Statistics and Computing*.

*arXiv:2107.10885 [Math, Stat]*.

*arXiv:2006.11695 [Cs, Stat]*.

*Advances in Neural Information Processing Systems*.

*ICLR*.

*arXiv:1610.09787 [Cs, Stat]*.

*Advances in Neural Information Processing Systems*.

*Journal of Machine Learning Research*.

*UAI18*.

*arXiv:1701.07989 [Math]*.

*New Directions in Statistical Signal Processing*.

*NeurIPS Workshop on Bayesian Deep Learning*.

*ICLR*.

*Proceedings of the 37th International Conference on Machine Learning*.

*arXiv:2011.11955 [Cs, Math]*.

*arXiv:2101.12353 [Cs, Math, Stat]*.

*Neural Networks: The Official Journal of the International Neural Network Society*.

*The American Statistician*.