Probabilistic neural nets
Bayesian and other probabilistic inference in overparameterized ML
January 11, 2017 — April 27, 2023
Inferring densities and distributions in a massively parameterized deep learning setting in a Bayesian manner. Probabilistic networks are more general than Bayes.
Jospin et al. (2022) is a modern high-speed introduction and summary of many approaches.
Radford Neal’s thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.
Alex Graves’ poster of his paper (Graves 2011) presents a simple prior uncertainty method for recurrent nets — (diagonal Gaussian weight uncertainty) that I found elucidating. (There is a 3rd party quick and dirty implementation.)
One could refer to the 2019 NeurIPS Bayes deep learning workshop site, which introduced some more modern positioning.
Generative methods are useful, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seem to be in the air too.
We are free to consider classic neural network inference as a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting:
\[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]
Obviously, a few things are different from the point-estimate case; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a fraught issue we will mostly ignore for now. Usually, a prior is by default something like
\[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \]
for want of a better idea. This ends up being equivalent to the “weight decay” regularization in the sense that Bayesian priors and regularizations often are.
With that basis, we could do the usual stuff for Bayes inference, like considering the predictive posterior:
\[ p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta \]
Usually, this posterior turns out to be intractable to calculate in the very high-dimensional parameter spaces of NNs, so we choose something simpler. We could summarize our posterior update by the simple maximum a posteriori estimate:
\[ \theta_{\mathrm{MAP}}:=\operatorname{arg min}_{\theta} \mathcal{L}(\theta). \]
In this case, we have recovered the classic training of non-Bayes nets with some ad hoc regularization which we claim was secretly a prior. But we have no notion of predictive uncertainty if we stop there.
Usually, the model will possess many optima, leading to suspicion that we have not found a good global one. How do we maximize model evidence here in any case?
Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate are various approximations to Bayes inference we might try. What follows is a non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets with different trade-offs.
🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.
1 Natural Posterior Network
borchero/natural-posterior-network (Charpentier et al. 2022): some kind of reparameterization uncertainty?
2 MC sampling of weights by low-rank Matheron updates
This uses GP Matheron updates. Needs a shorter name but looks cool (Ritter et al. 2021). The idea is that we keep weights random but then create a sparse representation of the weights.
microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in PyTorch.
- Mean-field variational inference (MFVI): variational inference with fully factorized Gaussian (FFG) approximation.
- Variational inference with full-covariance Gaussian approximation (for each layer).
- Variational inference with inducing weights: each layer is augmented with a small matrix of inducing weights, then MFVI is performed in the inducing weight space.
- Ensemble in inducing weight space: same augmentation as above, but with ensembles in the inducing weight space.
3 Bayes by backprop
See Bayes by backprop.
4 Variational autoencoders
5 Sampling via Monte Carlo
TBD. For now, if the number of parameters is smallish, see Hamiltonian Monte Carlo.
6 Laplace approximation
See Laplace approximations. AlexImmer/Laplace: Laplace approximations for Deep Learning.
7 Via random projections
I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.
8 In Gaussian process regression
See kernel learning.
9 Via measure transport
See reparameterization.
10 Via infinite-width random nets
See wide NN.
11 Via NTK
How does this work? He, Lakshminarayanan, and Teh (2020).
12 Ensemble methods
Deep learning has its own variants model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout), although dropout, as he frames it, has been contested these days as a means of inference.
13 Neural GLM
I think this has a sparse Bayes flavor. M.-N. Tran et al. (2019) seems to randomize over input params?
14 Practicalities
The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.
15 Stochastic Gradient Descent as MC inference
See MCMC by SGD.
16 Khan and Rue’s Bayes Learning Rule
Bayes via natural gradient (Khan and Rue 2024; Zellner 1988).
We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
17 Incoming
Seminars! Laplace’s Demon: A Seminar Series about Bayesian Machine Learning at Scale
Dustin Tran’s uncertainty layers (D. Tran et al. 2018):
In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation — be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.…
While the framework we laid out so far tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). Bayesian Layers’ core assumption is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework.
- Bayesian Neural Networks by Duvenaud’s team