Probabilistic neural nets
Bayesian and other probabilistic inference in overparameterized ML
January 12, 2017 — April 27, 2023
Inferring densities and distributions in a massively parameterised deep learning settingin a Bayesian manner. Probabvilistic networks are more general than Bayes.
Jospin et al. (2022) is a modern high-speed intro and summary of many approaches.
Radford Neal’s thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.
Alex Graves’ poster of his paper (Graves 2011) of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) I found elucidating. (There is a 3rd party quick and dirty implementation.)
One could refer to the 2019 NeurIPS Bayes deep learning workshop site which introduced some more modern positioning.
Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.
We are free to consider classic neural network inference as sort-of a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting \[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]
Obviously a few things are different from the point-estimate case; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a whole fraught thing in ways we will mostly ignore for now. Usually a prior is by default something like \[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \] for want of a better idea. This ends up being equivalent to the “weight decay” regularisation in the sense that Bayesian priors and regularisations often are.
With that basis e could do the usual stuff for Bayes inference, like considering the predictive posterior \[ p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta \] Usually this posterior turns out to be intractable to calculate in the very-high-dimensional parameter spaces of NNs, so we choose something simpler. We could summarise our posterior update by simple maximum a posteriori estimate \[ \theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta). \] In this case we have recovered the classic training of non-Bayes nets with some ad hoc regularisation which we claim was secretly a prior. But we have no notion of predictive uncertainty if we stop there.
Usually the model will possess many optima, and this will lead suspicion that we have not found a good global one. How do we maximise model evidence here in any case?
Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate there are various approximations to Bayes inference we might try. What follows is a non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets with different trade-offs.
🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.
1 Natural Posterior Network
borchero/natural-posterior-network (Charpentier et al. 2022): some kind of reparameterization uncertainty?
2 MC sampling of weights by low-rank Matheron updates
This uses GP Matheron updates. Needs a shorter names but looks cool (Ritter et al. 2021). The idea is that we keep weights random, but then create a sparse representation of the weights.
microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in pytorch.
- Mean-field variational inference (MFVI): variational inference with fully factorised Gaussian (FFG) approximation.
- Variational inference with full-covariance Gaussian approximation (for each layer).
- Variational inference with inducing weights: each of the layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
- Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.
3 Bayes by backprop
See Bayes by backprop.
4 Variational autoencoders
5 Sampling via Monte Carlo
TBD. For now, if the number of parameters is smallish see Hamiltonian Monte Carlo.
6 Laplace approximation
See Laplace approximations AlexImmer/Laplace: Laplace approximations for Deep Learning.
7 Via random projections
I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.
8 In Gaussian process regression
See kernel learning.
9 Via measure transport
See reparameterization.
10 Via infinite-width random nets
See wide NN.
11 Via NTK
How does this work? He, Lakshminarayanan, and Teh (2020).
12 Ensemble methods
Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout) although dropout as he frames it has been contested these days as a means of inference.
13 Neural GLM
I think this has sparse bayes flavour. D. Tran et al. (2019); seems to randomise over input params?
14 Practicalities
The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.
15 Stochastic Gradient Descent as MC inference
See MCMC by SGD.
16 Khan and Rue’s Bayes Learning Rule
Bayes via natural gradient (Khan and Rue 2023; Zellner 1988).
We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
17 Incoming
Dustin Tran’s uncertainty layers [1812.03973] Bayesian Layers: A Module for Neural Network Uncertainty:
In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation — be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.…
While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layers’ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.
- Bayesian Neural Networks by Duvenaud’s team