- Backgrounders
- Natural Posterior Network
- MC sampling of weights by low-rank Matheron updates
- Mixture density networks
- Variational autoencoders
- Sampling via Monte Carlo
- Stochastic Gradient Descent as MC inference
- Laplace approximation
- Via random projections
- In Gaussian process regression
- Via measure transport
- Via infinite-width random nets
- Via NTK
- Ensemble methods
- Neural GLM
- Practicalities
- Incoming
- References

Inferring densities and distributions in a massively parameterised deep learning setting.

This is not intrinsically a Bayesian thing to do but in practice much of the demand to do probabilistic nets comes from the demand for Bayesian posterior inference for neural nets. Bayesian inference is, however, not the only way to do uncertainty quantification.

Neural networks are very far from simple exponential families where conjugate distributions might help, and so typically rely upon approximations or luck to approximate our true target of interest.

Closely related: Generative models where we train a process to generate a (possibly stochastic) phenomenon of interest.

## Backgrounders

Jospin et al. (2022) is a modern high-speed intro and summary of many approaches.

Radford Nealβs thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingmaβs thesis is a blockbuster in the more recent variational tradition.

Alex Gravesβ poster of his paper (Graves 2011) of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) I found elucidating. (There is a 3rd party quick and dirty implementation.)

One could refer to the 2019 NeurIPS Bayes deep learning workshop site which will have some more modern positioning. There was a tutorial in 2020: by Dustin Tran, Jasper Snoek, Balaji Lakshminarayanan: Practical Uncertainty Estimation & Out-of-Distribution Robustness in Deep Learning.

Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.

We are free to consider classic neural network inference as sort-of a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting \[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]

Obviously a few things are different from the point-estimate case; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a whole fraught thing in ways we will mostly ignore for now. Usually a prior is by default something like \[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \] for want of a better idea. This ends up being equivalent to the βweight decayβ regularisation in the sense that Bayesian priors and regularisations often are.

With that basis e could do the usual stuff for Bayes inference, like considering the predictive posterior \[ p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta \]

Usually this turns out to be intractable to calculate in the very high dimension parameters spaces of NNs, so we choose something simpler.
We could summarise our posterior update by simple maximum a posteriori estimate
\[
\theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta).
\]
In this case we have recovered the classic training of non-Bayes nets with some *ad hoc* regularisation which we claimed was a prior.
But we have no notion of predictive uncertainty if we stop there.

Usually the model will possess many optima, and this will lead suspicion that we have not found a good global one. How do we maximise model evidence here in any case?

Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate there are various approximations to Bayes inference we might try. What follows is an non-exhaustive smΓΆrgΓ₯sbord of options to do probabilistic inference in neural nets with different trade-offs.

π To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.

## Natural Posterior Network

borchero/natural-posterior-network (Charpentier et al. 2022): some kind of reparameterization uncertainty.

## MC sampling of weights by low-rank Matheron updates

This uses GP matheroan updates. Needs a shorter names but looks cool (Ritter et al. 2021).

microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in pytorch.

- Mean-field variational inference (MFVI): variational inference with fully factorised Gaussian (FFG) approximation.
- Variational inference with full-covariance Gaussian approximation (for each layer).
- Variational inference with inducing weights: each of the layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
- Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.

## Mixture density networks

Nothing to say for now but here are some recommendations I received about this classic (C. Bishop 1994) method.

## Variational autoencoders

## Sampling via Monte Carlo

TBD. For now, if the number of parameters is smallish see Hamiltonian Monte Carlo.

## Stochastic Gradient Descent as MC inference

See MCMC by SGD.

## Laplace approximation

See Laplace approximations AlexImmer/Laplace: Laplace approximations for Deep Learning.

## Via random projections

I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.

## In Gaussian process regression

See kernel learning.

## Via measure transport

See reparameterization.

## Via infinite-width random nets

See wide NN.

## Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

## Ensemble methods

Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Galβs PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout) although dropout as he frames it has become highly controversial these days as a means of inference.

## Neural GLM

I think this has sparse bayes flavour. M.-N. Tran et al. (2019); seems to randomise over input params?

## Practicalities

The computational toolsets for βneuralβ probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

## Incoming

Dustin Tranβs uncertainty layers [1812.03973] Bayesian Layers: A Module for Neural Network Uncertainty:

In our work, we extend layers to capture βdistributions over functionsβ, which we describe as a layer with uncertainty about some state in its computation β be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight con- figuration.β¦

While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layersβ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.

## References

*Advances in Neural Information Processing Systems 29*.

*arXiv:2005.12998 [Math]*, January.

*SIAM Journal on Scientific Computing*38 (1): A243β72.

*Proceedings of the 39th International Conference on Machine Learning*, 414β34. PMLR.

*arXiv:2110.11216 [Cs, Math, Stat]*, October.

*arXiv:1511.07367 [Stat]*, November.

*Inverse Problems*36 (11): 115003.

*arXiv:1907.03382 [Cs, Stat]*.

*UAI18*.

*Microsoft Research*, January.

*Pattern Recognition and Machine Learning*. Information Science and Statistics. New York: Springer.

*International Conference on Machine Learning*, 537β46.

*Journal of the American Statistical Association*88 (421): 9β25.

*arXiv:1703.04818 [Cs]*, March.

*Computer Physics Communications*244 (November): 170β79.

*Mathematics of Computation*91 (335): 1247β80.

*arXiv:2105.04471 [Cs, Stat]*, March.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572β83. Curran Associates, Inc.

*Proceedings of the 35th International Conference on Machine Learning*, 844β53. PMLR.

*PMLR*.

*Artificial Intelligence and Statistics*, 207β15.

*arXiv:2012.07244 [Cs]*, March.

*arXiv:2106.14806 [Cs, Stat]*.

*Advances in Neural Information Processing Systems 28*, 1414β22. NIPSβ15. Cambridge, MA, USA: MIT Press.

*arXiv:1801.10395 [Stat]*, January.

*arXiv:2012.00152 [Cs, Stat]*, November.

*Journal of Machine Learning Research*19 (1): 2100β2145.

*arXiv:1904.01681 [Cs, Stat]*, April.

*arXiv:2105.04504 [Cs, Stat]*, May.

*arXiv:1703.11008 [Cs]*, October.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5309β19. Curran Associates, Inc.

*Proceedings of ICLR*.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 441β52. Curran Associates, Inc.

*arXiv:1704.04110 [Cs, Stat]*, April.

*arXiv:1906.11537 [Cs, Stat]*, June.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Advances in Approximate Bayesian Inference Workshop, NIPS*.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*.

*arXiv:1512.05287 [Stat]*.

*4th International Conference on Learning Representations (ICLR) Workshop Track*.

*arXiv:1506.02157 [Stat]*, May.

*arXiv:1705.07832 [Stat]*, May.

*arXiv:1807.01613 [Cs, Stat]*, July, 10.

*arXiv:1902.10298 [Cs]*, February.

*IEEE Transactions on Signal Processing*64 (13): 3444β57.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348β56. NIPSβ11. USA: Curran Associates Inc.

*arXiv:1308.0850 [Cs]*, August.

*2013 IEEE International Conference on Acoustics, Speech and Signal Processing*.

*arXiv:1502.04623 [Cs]*, February.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2629β37. Curran Associates, Inc.

*Proceedings of ICLR*.

*arXiv:1805.08034 [Cs, Math]*, May.

*Advances in Neural Information Processing Systems*. Vol. 33.

*PMLR*, 361β69.

*arXiv:1706.00550 [Cs, Stat]*.

*arXiv:1809.09505 [Cs, Math, Stat]*, September.

*arXiv:2104.04975 [Cs, Stat]*, June.

*International Conference on Artificial Intelligence and Statistics*, 703β11. PMLR.

*Spatial Statistics*, Spatial Statistics Miami, 8 (May): 20β38.

*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*, 1169β79. PMLR.

*arXiv:2007.06823 [Cs, Stat]*, January.

*Proceedings of ICLR*.

*arXiv:1906.01930 [Cs, Stat]*, July.

*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc.

*ICLR 2014 Conference*.

*Inverse Problems*35 (9): 095005.

*UAI17*.

*arXiv Preprint arXiv:1511.05121*.

*ICML 2020*.

*arXiv:2010.02709 [Cs, Stat]*, May.

*Uncertainty in Artificial Intelligence*.

*arXiv:1512.09300 [Cs, Stat]*, December.

*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 54:1338β48. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR.

*arXiv Preprint arXiv:1705.10306*.

*Technometrics*44 (3): 230β41.

*ICLR*.

*Journal of Statistical Software*63 (i19): 1β25.

*Advances In Neural Information Processing Systems*.

*Journal of the American Statistical Association*0 (0): 1β18.

*Workshop on Learning to Generate Natural Language*.

*Computer Methods in Applied Mechanics and Engineering*259 (June): 24β39.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 6446β56. Curran Associates, Inc.

*arXiv Preprint arXiv:1603.04733*, 1708β16.

*PMLR*, 2218β27.

*Information Theory, Inference & Learning Algorithms*. Cambridge University Press.

*Neural Computation*4 (3): 448β72.

*arXiv Preprint arXiv:1705.09279*.

*JMLR*, April.

*arXiv:2004.12550 [Stat]*, October.

*Proceedings of the 32nd International Conference on Machine Learning*, 2408β17. PMLR.

*arXiv:1804.11271 [Cs, Stat]*.

*arXiv:1610.08733 [Stat]*, October.

*Proceedings of ICML*.

*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*, 1105β12.

*Journal of Biomedical Informatics*89 (January): 56β67.

*Technometrics*59 (1): 80β92.

*Extremes*21 (3): 441β62.

*Proceedings of the 33rd International Conference on Neural Information Processing Systems*, 14003β14. Red Hook, NY, USA: Curran Associates Inc.

*arXiv:2111.08239 [Cs, Stat]*, November.

*IEEE Transactions on Neural Networks*12 (6): 1278β87.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338β47. Curran Associates, Inc.

*International Conference on Artificial Intelligence and Statistics*, 1126β36. PMLR.

*Russian Mathematical Surveys*50 (6): 1151.

*Journal of Computational Physics*378 (February): 686β707.

*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.

*International Conference on Machine Learning*, 1530β38. ICMLβ15. Lille, France: JMLR.org.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*, 6.

*arXiv:2105.14594 [Cs, Stat]*, May.

*arXiv:1604.00860 [Stat]*, September.

*Advances In Neural Information Processing Systems*.

*arXiv:1802.03335 [Stat]*, February.

*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*, 11.

*arXiv:1404.5886 [Math, Stat]*, April.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*77 (1): 3β33.

*Proceedings of the 32nd International Conference on Machine Learning*.

*Statistics and Computing*30 (2): 419β46.

*arXiv:2107.10885 [Math, Stat]*, July.

*arXiv:2006.11695 [Cs, Stat]*, December.

*Journal of Machine Learning Research*23 (74): 1β56.

*Advances in Neural Information Processing Systems*, 34:19730β42. Curran Associates, Inc.

*Advances in Neural Information Processing Systems*32.

*ICLR*.

*arXiv:1610.09787 [Cs, Stat]*, October.

*Journal of Computational and Graphical Statistics*0 (ja): 1β40.

*arXiv:1701.07989 [Math]*, April.

*New Directions in Statistical Signal Processing*. Vol. 155. MIT Press.

*ICLR*.

*Proceedings of the 37th International Conference on Machine Learning*, 10248β59. PMLR.

*arXiv:2011.11955 [Cs, Math]*.

*arXiv:2101.12353 [Cs, Math, Stat]*, January.

*Neural Networks: The Official Journal of the International Neural Network Society*10 (1): 99β109.

## No comments yet. Why not leave one?