One of the practical forms of Bayesian inference for massively parameterised networks.

## Explicit ensembles

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive
(He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013).
This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell.
But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having *many* such models is presumably ridiculous.
You need tricks.
There are various such tricks. BatchEnsemble is one famous one (Wen, Tran, and Ba 2020).

## Implicit ensembles

Dropout is the one I am most familiar with here but maybe there are more approaches.
A popular kind of noise layer which randomly zeroes out some coeeficients in the net when training (And optionally while predicting.)
A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead.
Here, however, we are trying to average over *strong* learners, not weak learners.

The key insight here is that dropout has a rationale in terms of model averaging and as a kind of implicit probabilistic learning because in the limit it approaches a certain Gaussian process. (Gal and Ghahramani 2016, 2015) Claims about using this for intriguing types of inference are about, e.g. (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

## Is this dangerous?

Should we stop weighting and start “stacking”? Yao et al. (2018)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.

## Questions

These methods focus generally on the posterior predictive. How do I find posteriors for other (interpretable) values in my model?

## References

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*. http://arxiv.org/abs/1506.02142.

*4th International Conference on Learning Representations (ICLR) Workshop Track*. http://arxiv.org/abs/1506.02158.

*Advances in Neural Information Processing Systems*. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.

*Computer Vision – ECCV 2016*, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_39.

*Proceedings of the 31st International Conference on Neural Information Processing Systems*, 6405–16. NIPS’17. Red Hook, NY, USA: Curran Associates Inc. http://arxiv.org/abs/1612.01474.

*JMLR*, April. http://arxiv.org/abs/1704.04289.

*Proceedings of ICML*. http://arxiv.org/abs/1701.05369.

*Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada.*, 5.

*ICLR*. http://arxiv.org/abs/2002.06715.

*Bayesian Analysis*13 (3): 917–1007. https://doi.org/10.1214/17-BA1091.