Probabilistic neural nets

Machinelearnese for “Bayesian”


One of the practical forms of Bayesian inference for massively parameterised networks.

Explicit ensembles

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one famous one (Wen, Tran, and Ba 2020).

Distilling

So apparently you can train a model to emulate an ensemble of similar models? Great terminology here; Hinton, Vinyals, and Dean (2015) refer to distilling of dark knowledge.

See Bubeck on this: Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation.

Dropout

Dropout is by contrast an implicit ensembling method. Or maybe the implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019); Gal, Hron, and Kendall (2017); Gal and Ghahramani (2015).

A popular kind of noise layer which randomly zeroes out some coeeficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.

The key insight here is that dropout has a rationale in terms of model averaging and as a kind of implicit probabilistic learning because in the limit it approaches a certain Gaussian process. (Gal and Ghahramani 2016a, 2015) Claims about using this for intriguing types of inference are about, e.g. (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

Is this dangerous?

Should we stop weighting and start “stacking”? Yao et al. (2018)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.

Questions

These methods focus generally on the posterior predictive. How do I find posteriors for other (interpretable) values in my model?

References

Chipman, Hugh A, Edward I George, and Robert E Mcculloch. 2006. “Bayesian Ensemble Learning.” In, 8.
Foong, Andrew Y K, David R Burt, Yingzhen Li, and Richard E Turner. 2019. “Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks.” In 4th Workshop on Bayesian Deep Learning (NeurIPS 2019), 17.
Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16). http://arxiv.org/abs/1506.02142.
———. 2016a. “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track. http://arxiv.org/abs/1506.02158.
———. 2016b. “Dropout as a Bayesian Approximation: Appendix.” May 25, 2016. http://arxiv.org/abs/1506.02157.
Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” May 22, 2017. http://arxiv.org/abs/1705.07832.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” March 9, 2015. http://arxiv.org/abs/1503.02531.
Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In Computer VisionECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_39.
Kasim, M. F., D. Watson-Parris, L. Deaconu, S. Oliver, P. Hatfield, D. H. Froula, G. Gregori, et al. 2020. “Up to Two Billion Times Acceleration of Scientific Simulations with Deep Neural Architecture Search.” January 17, 2020. http://arxiv.org/abs/2001.08055.
Kasim, Muhammad, J Topp-Mugglestone, P Hatfield, D H Froula, G Gregori, M Jarvis, E Viezzer, and Sam Vinko. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In, 5.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6405–16. NIPS’17. Red Hook, NY, USA: Curran Associates Inc. http://arxiv.org/abs/1612.01474.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML. http://arxiv.org/abs/1701.05369.
Pearce, Tim, Mohamed Zaki, and Andy Neely. 2018. “Bayesian Neural Network Ensembles.” Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada., 5.
Wen, Yeming, Dustin Tran, and Jimmy Ba. 2020. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR. http://arxiv.org/abs/2002.06715.
Wortsman, Mitchell, Maxwell Horton, Carlos Guestrin, Ali Farhadi, and Mohammad Rastegari. 2021. “Learning Neural Network Subspaces.” February 20, 2021. http://arxiv.org/abs/2102.10472.
Xie, Jingjing, Bing Xu, and Zhang Chuang. 2013. “Horizontal and Vertical Ensemble with Deep Representation for Classification.” June 12, 2013. http://arxiv.org/abs/1306.2759.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis 13 (3): 917–1007. https://doi.org/10.1214/17-BA1091.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.