One of the practical forms of Bayesian inference for massively parameterised networks.
Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one famous one (Wen, Tran, and Ba 2020).
Dropout is by contrast an implicit ensembling method. Or maybe the implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019); Gal, Hron, and Kendall (2017); Gal and Ghahramani (2015).
A popular kind of noise layer which randomly zeroes out some coeeficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.
The key insight here is that dropout has a rationale in terms of model averaging and as a kind of implicit probabilistic learning because in the limit it approaches a certain Gaussian process. (Gal and Ghahramani 2016a, 2015) Claims about using this for intriguing types of inference are about, e.g. (M. Kasim et al. 2019; M. F. Kasim et al. 2020).
Is this dangerous?
The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.
These methods focus generally on the posterior predictive. How do I find posteriors for other (interpretable) values in my model?