Probabilistic neural nets

Machinelearnese for “Bayesian”


One of the practical forms of Bayesian inference for massively parameterised networks.

Explicit ensembles

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one famous one (Wen, Tran, and Ba 2020).

Implicit ensembles

Dropout is the one I am most familiar with here but maybe there are more approaches. A popular kind of noise layer which randomly zeroes out some coeeficients in the net when training (And optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.

The key insight here is that dropout has a rationale in terms of model averaging and as a kind of implicit probabilistic learning because in the limit it approaches a certain Gaussian process. (Gal and Ghahramani 2016, 2015) Claims about using this for intriguing types of inference are about, e.g. (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

Is this dangerous?

Should we stop weighting and start “stacking”? Yao et al. (2018)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.

Questions

These methods focus generally on the posterior predictive. How do I find posteriors for other (interpretable) values in my model?

References

Chipman, Hugh A, Edward I George, and Robert E Mcculloch. 2006. “Bayesian Ensemble Learning.” In, 8.
Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16). http://arxiv.org/abs/1506.02142.
———. 2016. “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track. http://arxiv.org/abs/1506.02158.
Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” May 22, 2017. http://arxiv.org/abs/1705.07832.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In Computer VisionECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_39.
Kasim, M. F., D. Watson-Parris, L. Deaconu, S. Oliver, P. Hatfield, D. H. Froula, G. Gregori, et al. 2020. “Up to Two Billion Times Acceleration of Scientific Simulations with Deep Neural Architecture Search.” January 17, 2020. http://arxiv.org/abs/2001.08055.
Kasim, Muhammad, J Topp-Mugglestone, P Hatfield, D H Froula, G Gregori, M Jarvis, E Viezzer, and Sam Vinko. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In, 5.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6405–16. NIPS’17. Red Hook, NY, USA: Curran Associates Inc. http://arxiv.org/abs/1612.01474.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML. http://arxiv.org/abs/1701.05369.
Pearce, Tim, Mohamed Zaki, and Andy Neely. 2018. “Bayesian Neural Network Ensembles.” Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada., 5.
Wen, Yeming, Dustin Tran, and Jimmy Ba. 2020. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR. http://arxiv.org/abs/2002.06715.
Xie, Jingjing, Bing Xu, and Zhang Chuang. 2013. “Horizontal and Vertical Ensemble with Deep Representation for Classification.” June 12, 2013. http://arxiv.org/abs/1306.2759.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis 13 (3): 917–1007. https://doi.org/10.1214/17-BA1091.