# Probabilistic neural nets

Bayesian and other probabilistic inference in overparameterized ML

January 11, 2017 — April 27, 2023

Inferring densities and distributions in a massively parameterized deep learning setting in a Bayesian manner. Probabilistic networks are more general than Bayes.

Jospin et al. (2022) is a modern high-speed introduction and summary of many approaches.

Radford Neal’s thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.

Alex Graves’ poster of his paper (Graves 2011) presents a simple prior uncertainty method for recurrent nets — (diagonal Gaussian weight uncertainty) that I found elucidating. (There is a 3rd party quick and dirty implementation.)

One could refer to the 2019 NeurIPS Bayes deep learning workshop site, which introduced some more modern positioning.

Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seem to be in the air too.

We are free to consider classic neural network inference as a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting:

\[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]

Obviously, a few things are different from the point-estimate case; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a fraught issue we will mostly ignore for now. Usually, a prior is by default something like

\[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \]

for want of a better idea. This ends up being equivalent to the “weight decay” regularization in the sense that Bayesian priors and regularizations often are.

With that basis, we could do the usual stuff for Bayes inference, like considering the predictive posterior:

\[ p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta \]

Usually, this posterior turns out to be intractable to calculate in the very high-dimensional parameter spaces of NNs, so we choose something simpler. We could summarize our posterior update by the simple maximum a posteriori estimate:

\[ \theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta). \]

In this case, we have recovered the classic training of non-Bayes nets with some *ad hoc* regularization which we claim was secretly a prior. But we have no notion of predictive uncertainty if we stop there.

Usually, the model will possess many optima, leading to suspicion that we have not found a good global one. How do we maximize model evidence here in any case?

Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate are various approximations to Bayes inference we might try. What follows is a non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets with different trade-offs.

🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.

## 1 Natural Posterior Network

borchero/natural-posterior-network (Charpentier et al. 2022): some kind of reparameterization uncertainty?

## 2 MC sampling of weights by low-rank Matheron updates

This uses GP Matheron updates. Needs a shorter name but looks cool (Ritter et al. 2021). The idea is that we keep weights random but then create a sparse representation of the weights.

microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in PyTorch.

- Mean-field variational inference (MFVI): variational inference with fully factorized Gaussian (FFG) approximation.
- Variational inference with full-covariance Gaussian approximation (for each layer).
- Variational inference with inducing weights: each layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
- Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.

## 3 Bayes by backprop

See Bayes by backprop.

## 4 Variational autoencoders

## 5 Sampling via Monte Carlo

TBD. For now, if the number of parameters is smallish, see Hamiltonian Monte Carlo.

## 6 Laplace approximation

See Laplace approximations. AlexImmer/Laplace: Laplace approximations for Deep Learning.

## 7 Via random projections

I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.

## 8 In Gaussian process regression

See kernel learning.

## 9 Via measure transport

See reparameterization.

## 10 Via infinite-width random nets

See wide NN.

## 11 Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

## 12 Ensemble methods

Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout), although dropout, as he frames it, has been contested these days as a means of inference.

## 13 Neural GLM

I think this has a sparse Bayes flavor. M.-N. Tran et al. (2019) seems to randomize over input params?

## 14 Practicalities

The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

## 15 Stochastic Gradient Descent as MC inference

See MCMC by SGD.

## 16 Khan and Rue’s Bayes Learning Rule

Bayes via natural gradient (Khan and Rue 2024; Zellner 1988).

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

## 17 Incoming

Seminars! Laplace’s Demon: A Seminar Series about Bayesian Machine Learning at Scale

Dustin Tran’s uncertainty layers (D. Tran et al. 2018):

In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation — be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.…

While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layers’ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.

- Bayesian Neural Networks by Duvenaud’s team

## 18 References

*Advances in Neural Information Processing Systems 29*.

*arXiv:2005.12998 [Math]*.

*SIAM Journal on Scientific Computing*.

*Proceedings of the 39th International Conference on Machine Learning*.

*arXiv:2110.11216 [Cs, Math, Stat]*.

*arXiv:1511.07367 [Stat]*.

*Inverse Problems*.

*arXiv:1907.03382 [Cs, Stat]*.

*Microsoft Research*.

*Pattern Recognition and Machine Learning*. Information Science and Statistics.

*Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37*. ICML’15.

*International Conference on Machine Learning*.

*Journal of the American Statistical Association*.

*arXiv:1703.04818 [Cs]*.

*Mathematics of Computation*.

*arXiv:2105.04471 [Cs, Stat]*.

*Proceedings of the 35th International Conference on Machine Learning*.

*Advances in Neural Information Processing Systems 31*.

*Proceedings of the 39th International Conference on Machine Learning*.

*PMLR*.

*Artificial Intelligence and Statistics*.

*arXiv:2012.07244 [Cs]*.

*arXiv:2106.14806 [Cs, Stat]*.

*Computer Physics Communications*.

*Advances in Neural Information Processing Systems 28*. NIPS’15.

*arXiv:1801.10395 [Stat]*.

*arXiv:2012.00152 [Cs, Stat]*.

*Journal of Machine Learning Research*.

*arXiv:1904.01681 [Cs, Stat]*.

*Proceedings of the 37th International Conference on Machine Learning*.

*arXiv:2105.04504 [Cs, Stat]*.

*arXiv:1703.11008 [Cs]*.

*Advances in Neural Information Processing Systems 30*.

*Proceedings of ICLR*.

*Advances in Neural Information Processing Systems 31*.

*arXiv:1704.04110 [Cs, Stat]*.

*arXiv:1906.11537 [Cs, Stat]*.

*International Statistical Review*.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*.

*arXiv:1512.05287 [Stat]*.

*4th International Conference on Learning Representations (ICLR) Workshop Track*.

*arXiv:1506.02157 [Stat]*.

*arXiv:1705.07832 [Stat]*.

*arXiv:1807.01613 [Cs, Stat]*.

*arXiv:1902.10298 [Cs]*.

*IEEE Transactions on Signal Processing*.

*Journal of Applied Econometrics*.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*. NIPS’11.

*arXiv:1308.0850 [Cs]*.

*2013 IEEE International Conference on Acoustics, Speech and Signal Processing*.

*arXiv:1502.04623 [Cs]*.

*Advances in Neural Information Processing Systems 28*.

*Proceedings of ICLR*.

*arXiv:1805.08034 [Cs, Math]*.

*Kalman Filtering and Neural Networks*. Adaptive and Learning Systems for Signal Processing, Communications, and Control.

*Advances in Neural Information Processing Systems*.

*PMLR*.

*arXiv:1809.09505 [Cs, Math, Stat]*.

*arXiv:1706.00550 [Cs, Stat]*.

*Proceedings of the 38th International Conference on Machine Learning*.

*International Conference on Artificial Intelligence and Statistics*.

*Spatial Statistics*, Spatial Statistics Miami,.

*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*.

*arXiv:2007.06823 [Cs, Stat]*.

*Proceedings of ICLR*.

*arXiv:1906.01930 [Cs, Stat]*.

*Artificial Intelligence and Statistics*.

*Advances in Neural Information Processing Systems 29*.

*ICLR 2014 Conference*.

*Inverse Problems*.

*UAI17*.

*arXiv Preprint arXiv:1511.05121*.

*ICML 2020*.

*Advances in Neural Information Processing Systems*.

*Uncertainty in Artificial Intelligence*.

*CoRR*.

*arXiv:1512.09300 [Cs, Stat]*.

*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*. Proceedings of Machine Learning Research.

*ICLR*.

*Technometrics*.

*arXiv Preprint arXiv:1705.10306*.

*Journal of Statistical Software*.

*Advances In Neural Information Processing Systems*.

*Journal of the American Statistical Association*.

*Workshop on Learning to Generate Natural Language*.

*Computer Methods in Applied Mechanics and Engineering*.

*Advances in Neural Information Processing Systems 30*.

*arXiv Preprint arXiv:1603.04733*.

*PMLR*.

*Neural Computation*.

*Information Theory, Inference & Learning Algorithms*.

*arXiv Preprint arXiv:1705.09279*.

*JMLR*.

*arXiv:2004.12550 [Stat]*.

*Proceedings of the 32nd International Conference on Machine Learning*.

*arXiv:1804.11271 [Cs, Stat]*.

*arXiv:1610.08733 [Stat]*.

*Proceedings of ICML*.

*Probabilistic Machine Learning: Advanced Topics*.

*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*.

*Journal of Biomedical Informatics*.

*Technometrics*.

*Extremes*.

*Proceedings of the 33rd International Conference on Neural Information Processing Systems*.

*arXiv:2111.08239 [Cs, Stat]*.

*IEEE Transactions on Neural Networks*.

*Advances in Neural Information Processing Systems 30*.

*International Conference on Artificial Intelligence and Statistics*.

*Russian Mathematical Surveys*.

*Journal of Computational Physics*.

*Journal of Computational Physics*.

*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning.

*International Conference on Machine Learning*. ICML’15.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*.

*Proceedings of Machine Learning and Systems*.

*arXiv:2105.14594 [Cs, Stat]*.

*arXiv:1604.00860 [Stat]*.

*Advances In Neural Information Processing Systems*.

*arXiv:1802.03335 [Stat]*.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*.

*arXiv:1404.5886 [Math, Stat]*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*Proceedings of the 32nd International Conference on Machine Learning*.

*Statistics and Computing*.

*arXiv:2107.10885 [Math, Stat]*.

*arXiv:2006.11695 [Cs, Stat]*.

*ICLR*.

*arXiv:1610.09787 [Cs, Stat]*.

*Journal of Computational and Graphical Statistics*.

*Advances in Neural Information Processing Systems*.

*Journal of Machine Learning Research*.

*UAI18*.

*arXiv:1701.07989 [Math]*.

*New Directions in Statistical Signal Processing*.

*NeurIPS Workshop on Bayesian Deep Learning*.

*Journal of Computational and Graphical Statistics*.

*ICLR*.

*Proceedings of the 37th International Conference on Machine Learning*.

*arXiv:2011.11955 [Cs, Math]*.

*arXiv:2101.12353 [Cs, Math, Stat]*.

*Neural Networks: The Official Journal of the International Neural Network Society*.

*The American Statistician*.